Runbook

Operational guide for diagnosing and resolving production issues using logs and metrics.

Log Interpretation
Metric Thresholds
Alert Investigation
SQLite Health
Common Failure Scenarios
Escalation

Log Interpretation

Hermes uses structured log lines at INFO, WARNING, and ERROR levels.

Key Log Patterns

Pattern	Level	Meaning
`Speedtest complete: X.X / X.X Mbps, X.X ms`	INFO	Normal run completed
`Speedtest failed`	ERROR	Runner could not complete test
`Exporter 'X' could not be initialized`	WARNING	Exporter skipped this cycle
`Alert sent successfully via X`	INFO	Alert provider delivered
`Alert provider 'X' failed`	ERROR	Provider could not send alert
`Runtime config saved`	INFO	UI-driven config change persisted
`PRAGMA wal_checkpoint`	DEBUG	SQLite WAL checkpoint triggered
`Running VACUUM`	INFO	SQLite fragmentation above 20 %
`pool_pending_approx=N`	INFO	Approximate in-flight alert tasks

Log Levels Summary

INFO: Normal operational events; no action required.
WARNING: Degraded operation (e.g., exporter skipped, retry attempted). Monitor for recurrence.
ERROR: Failed operation; investigate root cause. App continues running.
CRITICAL: Startup failures (e.g., missing required config). App may not function.

Metric Thresholds

Prometheus metrics are exposed on PROMETHEUS_PORT (default 8000). Suggested alert thresholds:

Metric	Label	Warning	Critical
`hermes_download_mbps`	`server`	< 50 % of baseline	< 20 % of baseline
`hermes_upload_mbps`	`server`	< 50 % of baseline	< 20 % of baseline
`hermes_ping_ms`	`server`	> 2× baseline	> 5× baseline
`hermes_consecutive_failures_total`	—	≥ 3	≥ failure_threshold

Tip: Set PROMETHEUS_DISABLE_LABELS=true to reduce label cardinality if using a high-cardinality server pool.

Alert Investigation

When an alert fires (consecutive_failures >= failure_threshold):

Check recent logs for Speedtest failed entries:

grep -i "speedtest failed" /var/log/hermes/app.log | tail -20

Verify network connectivity from the host running Hermes.
Check speedtest-cli is accessible and not rate-limited:
```
speedtest-cli --simple
```
Confirm alert provider config via GET /api/alerts/config — ensure enabled: true and all provider URLs/tokens are set.

Trigger a manual test to rule out transient failure:

curl -X POST http://localhost:8080/api/trigger -H "X-Api-Key: <key>"

Check health endpoint:
```
curl http://localhost:8080/healthz
```

SQLite Health

The SQLite database at data/hermes.db uses WAL mode. The exporter automatically:

Runs PRAGMA wal_checkpoint(TRUNCATE) after any pruning cycle.
Runs VACUUM when fragmentation exceeds 20 % (freelist_count / page_count > 0.20).

Manual Health Check

sqlite3 data/hermes.db "PRAGMA integrity_check;"
sqlite3 data/hermes.db "PRAGMA page_count; PRAGMA freelist_count;"

Signs of Problems

Symptom	Likely Cause	Action
`database disk image is malformed`	Corruption	Restore from backup; re-initialise
WAL file growing indefinitely	Checkpoint blocked by long reader	Restart Hermes
Very slow queries	High fragmentation	Run `VACUUM` manually

Common Failure Scenarios

Hermes fails to start

Check SPEEDTEST_INTERVAL_MINUTES is a valid integer (≥ 1).
Verify PROMETHEUS_PORT is not already in use.
Confirm .env or environment variables are loaded.

No results being written

Ensure at least one exporter is listed in ENABLED_EXPORTERS.
Confirm data/ directory exists and is writable.
Check for Exporter 'X' could not be initialized warnings in logs.

Alerts not firing

Confirm ALERT_ENABLED=true.
Verify ALERT_FAILURE_THRESHOLD is set to expected value (default 3).
Test delivery: POST /api/alerts/test.
Check pool_pending_approx in logs — if always 0, thread pool may be exhausted.

High Prometheus label cardinality

Set PROMETHEUS_DISABLE_LABELS=true to collapse all label values to empty strings.
This reduces each metric to a single time series regardless of server.

Loki exporter silently skipped

LOKI_URL must be set when loki is in ENABLED_EXPORTERS.
Without it, the Loki factory returns None and the exporter is skipped with a WARNING.

Escalation

If an issue cannot be resolved via the steps above:

Collect logs from the past 24 hours.
Export current Prometheus metrics snapshot.
Note the current runtime config (GET /api/config).
File an issue in the project repository with the above artefacts.

Table of Contents