When a Region Stops Receiving Traffic, “Resolved” Isn’t the Same as “Recovered”
Math Machine: Revert-First Recovery Corridor Machine
License: CC BY 4.0
Source: https://status.supabase.com/incidents/pqrf96m6fzxk
Facts
The source reports an incident spanning Feb 12, 2026 21:32 UTC → Feb 13, 2026 01:53 UTC, with a follow-up postmortem note posted Feb 14, 2026 03:27 UTC. It describes rising server errors across some US regions, escalating to a state where one region “wasn’t receiving any network traffic,” alongside API request errors elsewhere. The source states a potential internal networking configuration change was identified and reverted, after which services appeared to recover; the response included requeuing failed jobs and monitoring. Scope (how many customers/projects) is not specified publicly on the incident page. [4]
What we add / What’s new
We separate incident resolution (a fix/revert stops the bleeding) from recovery completion (backlogs drained, retries converged, no silent partials). Treating these as different closure states prevents false “done” declarations when the user-visible world is still catching up. [2], [3], [5]
We frame “region not receiving traffic” as a boundary failure rather than a generic outage: the system can appear healthy on average while one corridor effectively goes dark. This is where tail behavior dominates and global rollups mislead. [1], [5], [7]
We introduce a receipts mindset for operational change: when a network config change is reverted, the closure target should include checkable evidence that the pre-change routing reality is restored (not only that error rates dipped). [3], [6], [8]
Why it matters
For teams running production workloads, a regional traffic failure is not just downtime; it is uncertainty about state: partial requests, retries, queued jobs, and stale dashboards can produce “split reality” where some users see normal behavior while others experience a stalled or inconsistent system. That uncertainty is often more expensive than the headline outage window. [5], [7], [10]
Hypotheses
H1 — The dominant risk is “recovery lag masquerading as resolution”: once the revert happens, the system looks normal, but backlog draining and retries create extended user-visible inconsistency. [2] Falsifier: show that after the “resolved” timestamp, job queues, retry volume, and user-facing error rates return to baseline quickly and remain stable without secondary waves.
H2 — “Traffic not received” events concentrate in a narrow change-set boundary: a small configuration shift can invalidate an entire corridor, while aggregate health metrics still look acceptable. [1] Falsifier: show that broad health indicators reliably detected and bounded the impact early, with no corridor-specific blind spot.
H3 — The most reliable operational control is revert discipline plus post-revert verification receipts: you do not just revert; you prove the corridor is back (routes, error budgets, queue convergence). [3] Falsifier: demonstrate that revert-only response (without explicit verification artifacts) yields equal long-run stability and fewer repeat incidents than revert + receipts.
Where it flips (regimes)
Conclusions invert across (1) workloads dominated by queued/background jobs vs real-time request/response flows, (2) systems with strong idempotency + safe retries vs systems where retries amplify harm, (3) organizations with explicit region isolation assumptions vs those treating regions as interchangeable, and (4) environments where users can reroute/shift regions quickly vs environments where they cannot. [5], [6], [7], [10]
Math behind it (without math)
The inference trap is averaging: if “most regions recovered,” you can conclude the system is “fine,” while a minority corridor remains effectively offline or inconsistent. Governance requires corridor-aware signals: you track not just error rates, but the shape of recovery—queue convergence, retry decay, and whether state becomes consistent again for the affected cohort. [1], [2], [5], [7]
Closure target
“Settled/done” means a checkable bundle exists showing: (a) the reverted configuration state is confirmed in effect, (b) region-level traffic and error rates match pre-incident baselines, (c) backlog and retries converged (requeue complete, no long-tail stragglers), (d) customer-visible invariants hold again (no partial reads, stale state, or hidden throttling), and (e) a bounded change-control rule exists to prevent recurrence of the same corridor break. [2], [6], [7], [8]
References
[1] R. Figurelli, “The Subfield Collapse Hypothesis: A Unified Explanation for OOD Inversions and Syntactic Shortcuts,” preprint, 2026.
[2] R. Figurelli, “Time-to-Okay (TTO) as an Agile Health Metric: Measuring Recovery Across Multitime Clocks,” preprint, 2026.
[3] R. Figurelli, “The Rollback Ratio: Turning Irreversibility into Managed Risk,” preprint, 2026.
[4] “Outage in US-East-2 (Ohio),” incident report, 2026.
[5] B. Beyer, C. Jones, J. Petoff, and N. Murphy, Site Reliability Engineering, book, 2016.
[6] NIST, “Computer Security Incident Handling Guide (SP 800-61 Rev. 2),” guideline, 2012.
[7] NIST, “Information Security Continuous Monitoring (ISCM) for Federal Information Systems and Organizations (SP 800-137),” guideline, 2011.
[8] NIST, “Security and Privacy Controls for Information Systems and Organizations (SP 800-53 Rev. 5),” standard, 2020.
[9] ISO/IEC, “Information Security Incident Management,” standard, 2016.
[10] AXELOS, “ITIL 4 Foundation,” framework, 2019.
