Project: Hermes Speed Monitor
Last Updated: 2026-05-01
Purpose: Comprehensive list of all possible errors by module with causes and remediation
Table of Contents
API Module
Authentication & Authorization (src/api/auth.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Missing X-Api-Key header |
HTTPException(401) |
High |
Client didn’t send required header |
Add X-Api-Key header to all requests |
| Invalid API key |
HTTPException(403) |
High |
API key doesn’t match configured value |
Verify API key matches .env configuration |
| Rate limit exceeded |
HTTPException(429) |
Medium |
Too many requests from client |
Implement backoff, reduce request frequency |
Prevention:
- Always send
X-Api-Key header
- Store API key securely (environment variable, secrets vault)
- Implement client-side rate limiting
Configuration Routes (src/api/routes/config.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Invalid configuration payload |
HTTPException(400) |
Medium |
Validation failed on config update |
Check interval (5-1440), valid exporter names |
| Runtime config save failed |
HTTPException(500) |
High |
Filesystem error saving config |
Check permissions, disk space |
Prevention:
- Validate payload before sending (client-side validation)
- Ensure data directory is writable
- Use atomic writes (handled by backend)
Results Routes (src/api/routes/results.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| No database found yet |
HTTPException(503) |
Low |
No speedtests have run |
Wait for first test or trigger manually |
| Invalid pagination parameters |
HTTPException(400) |
Low |
page/page_size out of range |
Use page ≥ 1, page_size 1-500 |
| Database query failed |
HTTPException(500) |
High |
SQLite error during query |
Check database integrity, reduce page_size |
Prevention:
- Enable SQLite exporter
- Trigger first speedtest after deployment
- Use reasonable page sizes (≤100)
Trigger Routes (src/api/routes/trigger.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Failed to start manual test thread |
HTTPException(500) |
High |
Thread creation failed |
Check system resources, restart container |
| Trigger file write failed |
Internal |
Medium |
Filesystem error |
Check permissions on data directory |
Prevention:
- Monitor system resources (threads, memory)
- Ensure data directory is writable
- Don’t trigger tests too frequently (respect rate limits)
Alert Routes (src/api/routes/alerts.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Threshold must be positive integer |
HTTPException(400) |
Low |
Invalid alert threshold |
Use threshold ≥ 1 |
| Cooldown must be positive integer |
HTTPException(400) |
Low |
Invalid cooldown |
Use cooldown ≥ 1 (minutes) |
| Webhook URL invalid scheme |
HTTPException(400) |
Medium |
URL doesn’t start with http/https |
Use http:// or https:// |
| Gotify/ntfy URL required |
HTTPException(400) |
Medium |
Provider enabled but URL missing |
Provide valid URL for enabled provider |
| Gotify token required |
HTTPException(400) |
Medium |
Token missing for Gotify |
Provide app token from Gotify |
| ntfy topic required |
HTTPException(400) |
Medium |
Topic missing for ntfy |
Provide topic name for ntfy |
| SSRF protection triggered |
HTTPException(400) |
High |
URL points to private IP |
Use public URLs only (not 127.0.0.1, 10.x, etc.) |
Prevention:
- Validate alert configuration before submitting
- Use only public URLs for webhooks (no localhost, private IPs)
- Keep threshold reasonable (2-5 failures)
- Set cooldown to avoid alert storms (60+ minutes recommended)
Configuration Module
Environment Configuration (src/config.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| API_KEY not set |
SystemExit(1) |
Critical |
Missing required env var |
Set API_KEY in .env file |
| API_KEY validation failed |
SystemExit(1) |
Critical |
API key too short (<32 chars) |
Generate secure key: openssl rand -hex 32 |
| Invalid PROMETHEUS_PORT |
ValueError |
High |
Port not integer or out of range |
Use port 1-65535 |
| Invalid RATE_LIMIT |
ValueError |
Medium |
Rate limit not positive integer |
Use positive integer (recommended: 60) |
| Invalid LOKI_TIMEOUT |
ValueError |
Medium |
Timeout not positive |
Use positive integer (seconds) |
Prevention:
- Use
.env.example as template
- Generate secure random API key
- Validate environment variables before deployment
- Use default ports when possible (Prometheus: 9090)
Runtime Configuration (src/runtime_config.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Runtime config file corrupted |
Warning |
Medium |
Malformed JSON |
Delete file, allow regeneration |
| interval_minutes out of range |
Warning |
Low |
Value not 5-1440 |
Fallback to default (60) |
| enabled_exporters not a list |
Warning |
Low |
Wrong type |
Fallback to default (all enabled) |
| Unknown exporter in list |
Warning |
Low |
Invalid exporter name |
Filtered out, valid ones kept |
| Alert threshold out of range |
Warning |
Low |
Not 1-100 |
Fallback to default (3) |
| Alert cooldown out of range |
Warning |
Low |
Not 1-10080 |
Fallback to default (60) |
| Could not save runtime config |
RuntimeError |
High |
Filesystem error |
Check permissions, disk space |
| Temp file cleanup failed |
Warning |
Low |
Orphaned temp file |
Non-fatal, temp file remains |
Prevention:
- Don’t manually edit runtime_config.json (use API)
- Ensure data directory is writable
- Monitor disk space
- Regular backups of config file
Dispatcher Module
Result Dispatcher (src/result_dispatcher.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Invalid result type |
TypeError |
Critical |
Non-exporter passed to register() |
Programming error - fix code |
| Exporter registration duplicate |
Warning |
Low |
Same exporter registered twice |
Ignored, last registration wins |
| No exporters registered |
Warning |
Medium |
dispatch() called with no exporters |
Enable at least one exporter |
| One or more exporters failed |
DispatchError |
Medium |
Exporter threw exception |
Check exporter-specific logs |
| All exporters failed |
DispatchError |
High |
Every exporter failed |
Check system health, connectivity |
Prevention:
- Always register at least one exporter
- Monitor exporter health (Prometheus, Loki, SQLite)
- Test exporters individually before enabling
- Check network connectivity for remote exporters
Exporters
CSV Exporter (src/exporters/csv_exporter.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| CSV file write failed |
OSError |
High |
Filesystem error |
Check permissions, disk space |
| CSV directory creation failed |
OSError |
High |
Parent directory not writable |
Verify logs/ directory permissions |
| CSV pruning failed |
Warning |
Low |
Error reading/writing for prune |
Non-fatal, file continues to grow |
| CSV file corrupted |
ValueError |
Medium |
Invalid CSV format |
Backup and recreate file |
| Max rows exceeded (prune disabled) |
Warning |
Low |
File growing unbounded |
Enable pruning or manual cleanup |
Prevention:
- Ensure logs/ directory exists and is writable
- Enable pruning (max_rows or retention_days)
- Monitor file size:
du -h logs/results.csv
- Regular backups of CSV file
Prometheus Exporter (src/exporters/prometheus_exporter.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Invalid port number |
ValueError |
Critical |
Port not 1-65535 |
Use valid port number |
| Port already in use |
RuntimeError |
Critical |
Another process bound to port |
Use different port or kill process |
| Failed to start HTTP server |
RuntimeError |
Critical |
Server startup failed |
Check network config, firewall |
| Failed to update gauges |
Warning |
Medium |
Metric update error |
Check result data validity |
| Label cardinality too high |
Warning |
Low |
Too many unique label combinations |
Disable dynamic labels if needed |
Prevention:
- Use dedicated port for Prometheus (9090-9100)
- Check port availability before starting:
netstat -an | grep :9090
- Monitor metric cardinality (avoid unbounded ISP labels)
- Use
make labels optional feature if cardinality is issue
Loki Exporter (src/exporters/loki_exporter.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Loki URL is required |
ValueError |
Critical |
URL not configured |
Set LOKI_URL environment variable |
| Loki URL invalid scheme |
ValueError |
Critical |
URL not http/https |
Use http:// or https:// |
| Loki URL missing hostname |
ValueError |
Critical |
URL has no host |
Provide full URL with hostname |
| Loki URL includes credentials |
Warning |
Medium |
URL has username/password |
Remove credentials, use auth headers |
| Timeout must be positive |
ValueError |
Medium |
Invalid timeout value |
Use positive integer (seconds) |
| Loki job label empty |
ValueError |
Medium |
Job name is blank |
Provide non-empty job name |
| Loki push connection error |
RuntimeError |
High |
Network/DNS failure |
Check connectivity, DNS, firewall |
| Loki push timed out |
RuntimeError |
High |
Loki didn’t respond in time |
Increase timeout, check Loki health |
| Loki push rejected (HTTP error) |
RuntimeError |
High |
Loki returned 4xx/5xx |
Check Loki logs, verify auth/tenant |
Prevention:
- Test Loki connectivity before enabling:
curl http://loki:3100/ready
- Use reasonable timeout (5-10 seconds)
- Monitor Loki health and resource usage
- Verify Loki authentication/tenant configuration
- Check network latency to Loki
| Error |
Type |
Severity |
Cause |
Remediation |
| SQLite write failed |
RuntimeError |
High |
INSERT failed |
Check disk space, permissions, schema |
| Database locked (timeout) |
SQLiteLockTimeout |
High |
Lock held for >30s |
Reduce page_size, check for long queries |
| Database initialization failed |
RuntimeError |
Critical |
Schema creation failed |
Check permissions, disk space |
| Database migration failed |
RuntimeError |
Critical |
ALTER TABLE failed |
Backup DB, review migration code |
| Database corrupted |
sqlite3.DatabaseError |
Critical |
Integrity check failed |
Restore from backup or recreate |
| Disk full during write |
OSError |
Critical |
No space for database growth |
Free disk space or increase volume size |
Prevention:
- Use WAL mode (enabled by default) for better concurrency
- Monitor database size:
du -h data/results.db
- Regular integrity checks:
sqlite3 data/results.db "PRAGMA integrity_check;"
- Keep page_size reasonable (≤100) for queries
- Monitor disk space usage
- Regular backups of database file
Services
Speedtest Runner (src/services/speedtest_runner.py + src/providers/)
| Error |
Provider |
Severity |
Cause |
Remediation |
| Ookla CLI not found |
ookla |
Critical |
Binary not installed or not in PATH |
Install from https://www.speedtest.net/apps/cli |
| Speedtest execution failed |
ookla |
High |
CLI returned non-zero exit code |
Check internet connectivity; run speedtest manually |
| Speedtest timed out (>120s) |
ookla |
High |
Test did not complete in time |
Check network speed or ISP issues |
| JSON parse failed |
ookla |
High |
Invalid output from Ookla CLI |
Update Ookla CLI; inspect stderr in logs |
| NDT7 locate request failed |
ndt7 |
High |
Cannot reach M-Lab Locate API |
Check outbound HTTPS; M-Lab may be at capacity |
| NDT7 WebSocket error |
ndt7 |
High |
WebSocket connection dropped |
Check firewall; verify WSS (port 443) is allowed |
| Custom HTTP download failed |
custom |
High |
Request to download URL failed |
Verify SPEEDTEST_CUSTOM_URL_DOWNLOAD is reachable |
| Custom HTTP upload failed |
custom |
Medium |
POST to upload URL failed |
Verify SPEEDTEST_CUSTOM_URL_UPLOAD accepts POST |
| Custom URL scheme invalid |
custom |
Critical |
Non-HTTP/S scheme in URL |
Only http:// and https:// URLs are accepted |
| All providers exhausted |
all |
High |
Every provider in the chain failed |
Check connectivity; review per-provider errors above |
Prevention:
- Verify Ookla CLI (if used):
speedtest --version
- Test Ookla manually:
speedtest --format=json
- For
ndt7: ensure outbound WSS traffic is allowed on port 443
- For
custom: confirm download/upload URLs respond before deploying
- Monitor consecutive failure count for alerts
- Configure a fallback chain, e.g.,
SPEEDTEST_PROVIDERS=ookla,ndt7
Alert Manager (src/services/alert_manager.py)
| Error |
Type |
Severity |
Cause |
Remediation |
| Alert provider send failed |
Warning |
Medium |
Provider threw exception |
Check provider-specific logs |
| All alert providers failed |
Warning |
High |
Every enabled provider failed |
Check connectivity, provider health |
| Thread pool exhausted |
Warning |
High |
Too many pending alerts |
Increase thread pool size |
| Alert send timed out |
Warning |
Medium |
Provider didn’t respond |
Check provider health, network |
Prevention:
- Test alert providers before enabling
- Monitor alert provider health
- Don’t set threshold too low (avoid alert storms)
- Use cooldown to rate-limit alerts (60+ minutes)
- Verify network connectivity to alert services
Alert Providers (src/services/alert_providers.py)
Webhook Provider
| Error |
Type |
Severity |
Cause |
Remediation |
| Webhook connection error |
Warning |
Medium |
Network/DNS failure |
Check URL, DNS, firewall |
| Webhook timeout |
Warning |
Medium |
Endpoint too slow |
Increase timeout, optimize endpoint |
| Webhook HTTP error (4xx/5xx) |
Warning |
Medium |
Endpoint rejected request |
Check endpoint logs, auth |
| SSRF protection blocked URL |
ValueError |
High |
URL targets private IP |
Use public URLs only |
Gotify Provider
| Error |
Type |
Severity |
Cause |
Remediation |
| Gotify connection error |
Warning |
Medium |
Service unreachable |
Check Gotify service, network |
| Gotify timeout |
Warning |
Medium |
No response in time |
Check Gotify health, load |
| Gotify authentication failed |
Warning |
High |
Invalid app token |
Verify token from Gotify admin |
| Gotify HTTP error |
Warning |
Medium |
Service returned error |
Check Gotify logs |
ntfy Provider
| Error |
Type |
Severity |
Cause |
Remediation |
| ntfy connection error |
Warning |
Medium |
Service unreachable |
Check ntfy service, network |
| ntfy timeout |
Warning |
Medium |
No response in time |
Check ntfy health, load |
| ntfy HTTP error |
Warning |
Medium |
Service returned error |
Check ntfy logs, topic name |
| ntfy topic invalid |
ValueError |
Medium |
Topic contains invalid chars |
Use alphanumeric + dash/underscore |
Error Matrix
Severity Levels
| Severity |
Impact |
Example |
Response Time |
| Critical |
Service cannot start or is completely broken |
Missing API key, port conflict |
Immediate (blocks deployment) |
| High |
Core functionality fails but service continues |
Exporter failure, database locked |
Within hours (same day) |
| Medium |
Degraded functionality or performance impact |
Alert send failure, config validation |
Within days (this week) |
| Low |
Minor issues, warnings, edge cases |
Pruning failed, label cardinality |
When convenient (backlog) |
Error Categories
| Category |
Examples |
Typical Cause |
Prevention |
| Configuration |
Missing env var, invalid values |
Deployment error, typo |
Validate config before deployment |
| Network |
Connection error, timeout, DNS |
Infrastructure issue |
Health checks, retry logic |
| Filesystem |
Permission denied, disk full |
Resource exhaustion |
Monitor disk space, permissions |
| Database |
Lock timeout, corruption, full |
Concurrent access, disk issue |
WAL mode, backups, monitoring |
| Validation |
Out of range, wrong type |
User input error |
Client-side validation, schema |
| Programming |
Type error, logic error |
Bug in code |
Unit tests, static analysis |
Recovery Strategies by Error Type
| Error Type |
Strategy |
Implementation |
| Transient Network |
Retry with backoff |
speedtest_runner.py (1 retry) |
| Configuration Invalid |
Fallback to default |
runtime_config.py validation |
| Partial Failure |
Continue with successful |
result_dispatcher.py |
| Filesystem Error |
Graceful degradation |
CSV prune failure non-fatal |
| Resource Exhaustion |
Rate limiting, cleanup |
Alert cooldown, pruning |
| Corruption |
Atomic operations |
Runtime config atomic write |
Monitoring & Alerts
Key Metrics to Monitor
| Metric |
Threshold |
Action |
| Consecutive speedtest failures |
≥3 |
Alert configured via API |
| Exporter failure rate |
>10% |
Check exporter health |
| Database size |
>1GB |
Enable pruning or manual cleanup |
| CSV file size |
>100MB |
Enable pruning or rotation |
| Alert send failure rate |
>50% |
Check provider connectivity |
| API error rate (5xx) |
>1% |
Review logs, check resources |
| Disk usage |
>80% |
Free space or expand volume |
Log Analysis Commands
# Find all errors in last hour
docker-compose logs --since=1h | grep ERROR
# Count errors by type
docker-compose logs | grep ERROR | cut -d' ' -f5- | sort | uniq -c | sort -nr
# Find database lock errors
docker-compose logs | grep "Database locked"
# Find authentication failures
docker-compose logs | grep "Invalid API key"
# Find alert failures
docker-compose logs | grep "Alert.*failed"
Troubleshooting Flowcharts
Speedtest Failures
Speedtest fails
├─ Error: "command not found"
│ └─ Fix: Install speedtest-cli
├─ Error: "timed out"
│ └─ Check: Internet connectivity, ISP throttling
├─ Error: "JSON parse failed"
│ └─ Fix: Update speedtest-cli
└─ Multiple consecutive failures
└─ Action: Investigate ISP outage, check alerts
Exporter Failures
Exporter fails
├─ CSV Exporter
│ ├─ Error: "Permission denied"
│ │ └─ Fix: Check logs/ directory permissions
│ └─ Error: "Disk full"
│ └─ Fix: Free space or enable pruning
├─ Prometheus Exporter
│ ├─ Error: "Port in use"
│ │ └─ Fix: Use different port or kill process
│ └─ Error: "Update gauges failed"
│ └─ Check: Result data validity
├─ Loki Exporter
│ ├─ Error: "Connection error"
│ │ └─ Check: Loki service, network, DNS
│ └─ Error: "Push rejected"
│ └─ Check: Loki logs, auth, tenant
└─ SQLite Exporter
├─ Error: "Database locked"
│ └─ Fix: Reduce page_size, check long queries
└─ Error: "Write failed"
└─ Check: Disk space, permissions, integrity
API Authentication Failures
API request fails
├─ Status: 401 (Unauthorized)
│ └─ Fix: Add X-Api-Key header
├─ Status: 403 (Forbidden)
│ └─ Fix: Verify API key matches .env
└─ Status: 429 (Rate Limited)
└─ Fix: Reduce request frequency, implement backoff
Document History
- 2026-05-01: Initial documentation (M4 from ERROR-HANDLING-REVIEW.md)