Outage Detection Implementation Plan
Status: Not started
Target: v1.4
Estimated Total Effort: 10–14 hours
Overview
Three-tier passive outage detection wired into the existing scheduler loop. A new OutageDetector
service performs TCP socket probes before each speedtest to distinguish complete connectivity loss
from degraded performance. When an outage is confirmed the speedtest is skipped, an OutageEvent
is persisted to SQLite and CSV, and existing alert providers are notified via new AlertManager
methods. Optional RIPE Stat BGP enrichment and Cloudflare Radar annotation enrichment provide ISP
context in alert messages.
Detection tiers:
| Tier | Condition | Action |
|---|---|---|
| 1 | TCP probe majority-vote fails N consecutive rounds | Skip speedtest; fire CONNECTIVITY_LOST alert |
| 2 | Probe passes; speedtest SLA/anomaly flags degraded results | Enrich alert with ASN/BGP context (opt-in) |
| 3 | Probe passes; speedtest raises socket exception | Fire DNS_FAILURE or SPEEDTEST_SERVER_UNREACHABLE alert |
Design decisions (confirmed):
- Outage declared after N consecutive probe-failure rounds (
OUTAGE_PROBE_FAILURE_THRESHOLD, default 2), not immediately — mirrors theAlertManager.failure_thresholdmodel - RIPE Stat BGP enrichment is opt-in (
OUTAGE_ISP_CHECK_ENABLED=true, defaultfalse) - Cloudflare Radar annotation enrichment is optional behind
CLOUDFLARE_API_TOKEN - Outage suppression (
SharedState.outage_in_progress) is internal state, separate from the user-controlled scan-pause toggle - BGP enrichment is informational only — does not gate whether an alert fires
OutageEventrecords are stored in both SQLite and CSV alongsideSpeedResult- ASN is fetched once at startup via RIPE Stat
network-infoand cached inSharedState - BGP and CF check results are cached for 15 minutes to avoid API hammering
External APIs evaluated:
| API | Decision | Reason |
|---|---|---|
ipapi.co |
Rejected | Free tier explicitly “not for production use” |
ip-api.com |
Rejected | HTTP-only on free tier (OWASP A02 cleartext transmission) |
| Downdetector | Rejected | Commercial API only; no public access |
RIPE Stat network-info |
Selected | Free, no key required, HTTPS, returns ASN from public IP |
RIPE Stat bgpupdate-activity |
Selected (opt-in) | Free, no key required, HTTPS, BGP instability proxy |
| Cloudflare Radar annotations | Selected (optional) | Free with API key; curated outage events |
New Environment Variables
| Variable | Type | Default | Description |
|---|---|---|---|
OUTAGE_PROBE_HOSTS |
csv list | 1.1.1.1:53,8.8.8.8:53,9.9.9.9:53 |
TCP probe endpoints |
OUTAGE_PROBE_TIMEOUT |
int | 3 |
Seconds per probe attempt |
OUTAGE_PROBE_FAILURE_THRESHOLD |
int | 2 |
Consecutive failure rounds to declare DOWN |
OUTAGE_PROBE_QUORUM |
int | 2 |
Number of probes that must fail per round to count as a failure |
OUTAGE_ISP_CHECK_ENABLED |
bool | false |
Enable RIPE Stat BGP enrichment |
CLOUDFLARE_API_TOKEN |
str/None | None | Enables Cloudflare Radar annotation enrichment |
Phase 1 — Data Layer
Step 1.1: OutageEventType Constants
Status: Not started
Estimated Effort: 0.5 hours
Priority: High — required by all other steps
Changes Required:
- Add
OutageEventType(StrEnum)enum afterProviderTypewith values:CONNECTIVITY_LOST,CONNECTIVITY_RESTORED,SPEEDTEST_SERVER_UNREACHABLE,DNS_FAILURE - Add probe default constants:
DEFAULT_PROBE_HOSTS,DEFAULT_PROBE_TIMEOUT,DEFAULT_PROBE_FAILURE_THRESHOLD,DEFAULT_PROBE_QUORUM
Files to modify:
src/constants.py— add enum afterProviderType
Test coverage:
OutageEventTypemembers accessible as strings (StrEnum behaviour)- All four values present and unique
Step 1.2: OutageEvent Model
Status: Not started
Estimated Effort: 0.5 hours
Priority: High
Changes Required:
- Create
src/models/outage_event.pywithOutageEventdataclass - Fields:
event_type: OutageEventType,timestamp: datetime,duration_seconds: float | None,isp_name: str | None,asn: str | None,bgp_unstable: bool | None,cloudflare_outage_desc: str | None,probe_results: str
Files to modify:
src/models/outage_event.py(new file)
Test coverage:
OutageEventinstantiation with required fields onlyOutageEventinstantiation with all optional fields populated
Step 1.3: Config — New Env Vars
Status: Not started
Estimated Effort: 0.5 hours
Priority: High
Changes Required:
- Add the six new env vars to
src/config.pyafter the alerting section using existing_get_int(),_get_bool(),_get_str(), and_get_csv_list()helpers
Files to modify:
src/config.py— add six new attributes after alerting section
Test coverage:
- Default values applied when env vars absent
OUTAGE_PROBE_FAILURE_THRESHOLDrespects int coercionOUTAGE_PROBE_QUORUMdefaults to 2OUTAGE_ISP_CHECK_ENABLEDdefaults toFalseCLOUDFLARE_API_TOKENreturnsNonewhen unset
Step 1.4: SharedState — Outage State Fields
Status: Not started
Estimated Effort: 0.5 hours
Priority: High
Changes Required:
- Add module-level globals after existing globals:
_outage_in_progress: bool = False,_outage_start_time: datetime | None = None - Add thread-safe getter/setter functions:
set_outage_in_progress(),get_outage_in_progress(),set_outage_start_time(),get_outage_start_time()
Files to modify:
src/shared_state.py— add globals and accessor functions
Test coverage:
get_outage_in_progress()returnsFalseby defaultset_outage_in_progress(True)/get_outage_in_progress()round-tripget_outage_start_time()returnsNoneby default- Thread-safety: concurrent reads and writes do not corrupt state
Phase 2 — Detection Service
Step 2.1: OutageDetector Service
Status: Not started
Estimated Effort: 3–4 hours
Priority: High
Changes Required:
- Create
src/services/outage_detector.pywithOutageDetectorclass check_connectivity() → ConnectivityStatus: performs TCPsocket.create_connection()probes againstOUTAGE_PROBE_HOSTS; applies majority-vote quorum; increments or resets_consecutive_probe_failures; returnsUPorDOWN. DeclaresDOWNonly afterOUTAGE_PROBE_FAILURE_THRESHOLDconsecutive failure roundsget_public_ip() → str | None: HTTPSGET https://stat.ripe.net/data/network-info/data.json?resource={local_ip}; called once at startup; result cached inSharedStateget_isp_asn(ip: str) → str | None: parses ASN from RIPE Statnetwork-inforesponsecheck_bgp_stability(asn: str) → bool:GET https://stat.ripe.net/data/bgpupdate-activity/data.json?resource=AS{asn}; result cached 15 minutes; returnsTrueif BGP update activity is anomalously highcheck_cloudflare_outage(asn: str) → str | None:GET https://api.cloudflare.com/client/v4/radar/annotations/outages?asns={asn}withAuthorization: Bearer {CLOUDFLARE_API_TOKEN}; returns annotation description orNone; only called when token is set; result cached 15 minutes- Internal state:
_consecutive_probe_failures: int,_bgp_cache: dict[str, tuple[bool, datetime]],_cf_cache: dict[str, tuple[str | None, datetime]]
Files to modify:
src/services/outage_detector.py(new file)
Test coverage:
check_connectivity()returnsDOWNwhen quorum of probes failcheck_connectivity()returnsUPwhen fewer than quorum fail- Consecutive failure counter increments; resets on first
UPround DOWNnot declared untilOUTAGE_PROBE_FAILURE_THRESHOLDconsecutive failure rounds- Single
UPround restoresUPstate get_isp_asn()result cached;get_public_ip()only called once- BGP cache honours 15-minute TTL; fresh call made after expiry
- CF Radar not called when
CLOUDFLARE_API_TOKENis unset - CF Radar result cached 15 minutes
- All HTTP calls use HTTPS; plain HTTP URLs rejected
Step 2.2: AlertManager — Outage Methods
Status: Not started
Estimated Effort: 1.5 hours
Priority: High
Changes Required:
- Modify
record_failure()at line 100 insrc/services/alert_manager.py: add early return (no-op) whenSharedState.get_outage_in_progress() is True - Add
record_outage_start(isp_name: str | None, bgp_context: str | None) -> Noneafterreset()at line 301: setsSharedState.outage_in_progress = True, resets_consecutive_failures, sends outage-start alert via existing_send_alert_async() - Add
record_outage_recovered(duration_s: float) -> None: setsSharedState.outage_in_progress = False, sends recovery alert with duration in message
Files to modify:
src/services/alert_manager.py— modifyrecord_failure()at line 100; add two methods afterreset()at line 301
Test coverage:
record_failure()is a no-op whenoutage_in_progress = Truerecord_failure()operates normally whenoutage_in_progress = Falserecord_outage_start()setsSharedState.outage_in_progress = Truerecord_outage_start()resets_consecutive_failuresto zerorecord_outage_start()dispatches alert with ISP name and BGP context in messagerecord_outage_recovered()setsSharedState.outage_in_progress = Falserecord_outage_recovered()dispatches alert with duration in message- Cooldown still applies to outage-start alert
Phase 3 — Integration
Step 3.1: Wire OutageDetector into run_once()
Status: Not started
Estimated Effort: 2 hours
Priority: High
Changes Required:
- Add
outage_detector: OutageDetector | None = Noneparameter torun_once()at line 159 - At the top of
run_once()body, calloutage_detector.check_connectivity()when detector is notNone - If
ConnectivityStatus.DOWN: dispatchOutageEvent(CONNECTIVITY_LOST, ...), callalert_manager.record_outage_start(), return early (skip speedtest) - If state transitions from DOWN to UP: dispatch
OutageEvent(CONNECTIVITY_RESTORED, duration=...), callalert_manager.record_outage_recovered() - In the
except RuntimeErrorblock (line 211): inspect exception__cause__—socket.gaierrormaps toOutageEventType.DNS_FAILURE;socket.timeout/ConnectionErrormaps toOutageEventType.SPEEDTEST_SERVER_UNREACHABLE; then fall through to existingalert_manager.record_failure(str(e)) - Construct
OutageDetectorin_poll_once()at line 264 (or inmain()at line 417) and pass through; confirm via_validate_environment()at line 404 that required config is present
Files to modify:
src/main.py— modifyrun_once()at line 159; update_poll_once()at line 264 andmain()at line 417
Test coverage:
run_once()returns early without calling speedtest runner when detector returnsDOWNrun_once()dispatchesCONNECTIVITY_LOSTevent on DOWNrun_once()dispatchesCONNECTIVITY_RESTOREDevent with correct duration when transitioning UPsocket.gaierrorin speedtest producesDNS_FAILUREevent typesocket.timeoutin speedtest producesSPEEDTEST_SERVER_UNREACHABLEevent typeoutage_detector=None(default) preserves existing behaviour unchanged
Phase 4 — Persistence
Step 4.1: SQLite outage_events Table
Status: Not started
Estimated Effort: 1 hour
Priority: Medium
Changes Required:
- Add
_CREATE_OUTAGE_TABLEDDL constant tosrc/exporters/sqlite_exporter.pyafter the existing_CREATE_INDEXconstant at line 47:
CREATE TABLE IF NOT EXISTS outage_events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
event_type TEXT NOT NULL,
timestamp TEXT NOT NULL,
duration_seconds REAL,
isp_name TEXT,
asn TEXT,
bgp_unstable INTEGER,
cloudflare_outage_desc TEXT,
probe_results TEXT NOT NULL
)
- Append migration tuple to
_MIGRATIONSlist at line 105:("add_outage_events_table", _CREATE_OUTAGE_TABLE) - Add
export_outage_event(event: OutageEvent) -> Nonemethod afterexport()at line 153
Files to modify:
src/exporters/sqlite_exporter.py— add DDL constant after line 47; append to_MIGRATIONSat line 105; add method after line 153
Test coverage:
- Migration applied idempotently (running
_init_db()twice does not raise) outage_eventstable created on first_init_db()export_outage_event()inserts all column values correctlyduration_secondsstored asNULLwhen field isNonebgp_unstablestored as0/1integer
Step 4.2: CSV outage_events Export
Status: Not started
Estimated Effort: 0.5 hours
Priority: Low
Changes Required:
- Add
OUTAGE_FIELDNAMESlist tosrc/exporters/csv_exporter.pyafterFIELDNAMESat line 13 - Add
export_outage_event(event: OutageEvent) -> Nonemethod afterexport()at line 71; writes to a separateoutage_events.csvfile using the same rotation and prune logic as the main results CSV
Files to modify:
src/exporters/csv_exporter.py— addOUTAGE_FIELDNAMESafter line 13; add method after line 71
Test coverage:
outage_events.csvcreated when it does not yet existexport_outage_event()appends a row with correct field names and values- File rotation triggers at same size threshold as
results.csv
Phase 5 — API
Step 5.1: Outage API Routes
Status: Not started
Estimated Effort: 1.5 hours
Priority: Medium
Changes Required:
- Create
src/api/routes/outages.pywithrouter = APIRouter(tags=["outages"]) GET /outages: paginated list ofOutageEventrecords from SQLite; mirrors theResultsPagepattern insrc/api/routes/results.pywithpageandpage_size(max 500) query paramsGET /outage-status: returns{ "outage_in_progress": bool, "outage_start_time": str | null }fromSharedState- Register router in
src/api/main.pyat line 156 (afteranalysis.router):app.include_router(outages.router, prefix="/api")
Files to modify:
src/api/routes/outages.py(new file)src/api/main.py— addinclude_routercall at line 156
Test coverage:
GET /api/outagesreturns empty list when no events recordedGET /api/outagespagination:page=1&page_size=10returns correct slice and correcttotalGET /api/outagesrejectspage_size> 500GET /api/outage-statusreturnsoutage_in_progress: falseby defaultGET /api/outage-statusreflectsSharedStateafterrecord_outage_start()called
Open / Deferred Questions
- Recovery threshold: Current preference is that one successful probe round restores the UP state. Alternative: require M consecutive successes before clearing DOWN to reduce flap noise. Deferred until real-world testing.
- DNS failure sub-type handling: Tier 3
DNS_FAILUREevents currently fall through to the existingrecord_failure()alert path. A fully separate alert flow (skiprecord_failure(), use outage-style message) is deferred. - BGP enrichment latency: RIPE Stat
bgpupdate-activitydata can lag real-time BGP events by several minutes. Acceptable given enrichment is informational only. - CF Radar editorial lag: Cloudflare Radar annotations are curated and may lag 15–60 minutes,
or not appear at all for minor outages. To be documented in
.env.example.