Runbook
Operational guide for diagnosing and resolving production issues using logs and metrics.
Operational guide for diagnosing and resolving production issues using logs and metrics.
Table of Contents
- Log Interpretation
- Metric Thresholds
- Alert Investigation
- SQLite Health
- Common Failure Scenarios
- Escalation
Log Interpretation
Hermes uses structured log lines at INFO, WARNING, and ERROR levels.
Key Log Patterns
| Pattern | Level | Meaning |
|---|---|---|
Speedtest complete: X.X / X.X Mbps, X.X ms |
INFO | Normal run completed |
Speedtest failed |
ERROR | Runner could not complete test |
Exporter 'X' could not be initialized |
WARNING | Exporter skipped this cycle |
Alert sent successfully via X |
INFO | Alert provider delivered |
Alert provider 'X' failed |
ERROR | Provider could not send alert |
Runtime config saved |
INFO | UI-driven config change persisted |
PRAGMA wal_checkpoint |
DEBUG | SQLite WAL checkpoint triggered |
Running VACUUM |
INFO | SQLite fragmentation above 20 % |
pool_pending_approx=N |
INFO | Approximate in-flight alert tasks |
Log Levels Summary
- INFO: Normal operational events; no action required.
- WARNING: Degraded operation (e.g., exporter skipped, retry attempted). Monitor for recurrence.
- ERROR: Failed operation; investigate root cause. App continues running.
- CRITICAL: Startup failures (e.g., missing required config). App may not function.
Metric Thresholds
Prometheus metrics are exposed on PROMETHEUS_PORT (default 8000). Suggested alert thresholds:
| Metric | Label | Warning | Critical |
|---|---|---|---|
hermes_download_mbps |
server |
< 50 % of baseline | < 20 % of baseline |
hermes_upload_mbps |
server |
< 50 % of baseline | < 20 % of baseline |
hermes_ping_ms |
server |
> 2× baseline | > 5× baseline |
hermes_consecutive_failures_total |
— | ≥ 3 | ≥ failure_threshold |
Tip: Set
PROMETHEUS_DISABLE_LABELS=trueto reduce label cardinality if using a high-cardinality server pool.
Alert Investigation
When an alert fires (consecutive_failures >= failure_threshold):
-
Check recent logs for
Speedtest failedentries:grep -i "speedtest failed" /var/log/hermes/app.log | tail -20 -
Verify network connectivity from the host running Hermes.
-
Check speedtest-cli is accessible and not rate-limited:
speedtest-cli --simple -
Confirm alert provider config via
GET /api/alerts/config— ensureenabled: trueand all provider URLs/tokens are set. -
Trigger a manual test to rule out transient failure:
curl -X POST http://localhost:8080/api/trigger -H "X-Api-Key: <key>" -
Check health endpoint:
curl http://localhost:8080/healthz
SQLite Health
The SQLite database at data/hermes.db uses WAL mode. The exporter automatically:
- Runs
PRAGMA wal_checkpoint(TRUNCATE)after any pruning cycle. - Runs
VACUUMwhen fragmentation exceeds 20 % (freelist_count / page_count > 0.20).
Manual Health Check
sqlite3 data/hermes.db "PRAGMA integrity_check;"
sqlite3 data/hermes.db "PRAGMA page_count; PRAGMA freelist_count;"
Signs of Problems
| Symptom | Likely Cause | Action |
|---|---|---|
database disk image is malformed |
Corruption | Restore from backup; re-initialise |
| WAL file growing indefinitely | Checkpoint blocked by long reader | Restart Hermes |
| Very slow queries | High fragmentation | Run VACUUM manually |
Common Failure Scenarios
Hermes fails to start
- Check
SPEEDTEST_INTERVAL_MINUTESis a valid integer (≥ 1). - Verify
PROMETHEUS_PORTis not already in use. - Confirm
.envor environment variables are loaded.
No results being written
- Ensure at least one exporter is listed in
ENABLED_EXPORTERS. - Confirm
data/directory exists and is writable. - Check for
Exporter 'X' could not be initializedwarnings in logs.
Alerts not firing
- Confirm
ALERT_ENABLED=true. - Verify
ALERT_FAILURE_THRESHOLDis set to expected value (default 3). - Test delivery:
POST /api/alerts/test. - Check
pool_pending_approxin logs — if always 0, thread pool may be exhausted.
High Prometheus label cardinality
- Set
PROMETHEUS_DISABLE_LABELS=trueto collapse all label values to empty strings. - This reduces each metric to a single time series regardless of server.
Loki exporter silently skipped
LOKI_URLmust be set whenlokiis inENABLED_EXPORTERS.- Without it, the Loki factory returns
Noneand the exporter is skipped with a WARNING.
Escalation
If an issue cannot be resolved via the steps above:
- Collect logs from the past 24 hours.
- Export current Prometheus metrics snapshot.
- Note the current runtime config (
GET /api/config). - File an issue in the project repository with the above artefacts.
| See also: Error Catalog | Security Audit |