When a Region Stops Receiving Traffic, “Resolved” Isn’t the Same as “Recovered”

When a Region Stops Receiving Traffic, “Resolved” Isn’t the Same as “Recovered”

Math Machine: Revert-First Recovery Corridor Machine
License: CC BY 4.0
Source: https://status.supabase.com/incidents/pqrf96m6fzxk

Facts
The source reports an incident spanning Feb 12, 2026 21:32 UTC → Feb 13, 2026 01:53 UTC, with a follow-up postmortem note posted Feb 14, 2026 03:27 UTC. It describes rising server errors across some US regions, escalating to a state where one region “wasn’t receiving any network traffic,” alongside API request errors elsewhere. The source states a potential internal networking configuration change was identified and reverted, after which services appeared to recover; the response included requeuing failed jobs and monitoring. Scope (how many customers/projects) is not specified publicly on the incident page. [4]

What we add / What’s new
We separate incident resolution (a fix/revert stops the bleeding) from recovery completion (backlogs drained, retries converged, no silent partials). Treating these as different closure states prevents false “done” declarations when the user-visible world is still catching up. [2], [3], [5]

We frame “region not receiving traffic” as a boundary failure rather than a generic outage: the system can appear healthy on average while one corridor effectively goes dark. This is where tail behavior dominates and global rollups mislead. [1], [5], [7]

We introduce a receipts mindset for operational change: when a network config change is reverted, the closure target should include checkable evidence that the pre-change routing reality is restored (not only that error rates dipped). [3], [6], [8]

Why it matters
For teams running production workloads, a regional traffic failure is not just downtime; it is uncertainty about state: partial requests, retries, queued jobs, and stale dashboards can produce “split reality” where some users see normal behavior while others experience a stalled or inconsistent system. That uncertainty is often more expensive than the headline outage window. [5], [7], [10]

Hypotheses
H1 — The dominant risk is “recovery lag masquerading as resolution”: once the revert happens, the system looks normal, but backlog draining and retries create extended user-visible inconsistency. [2] Falsifier: show that after the “resolved” timestamp, job queues, retry volume, and user-facing error rates return to baseline quickly and remain stable without secondary waves.
H2 — “Traffic not received” events concentrate in a narrow change-set boundary: a small configuration shift can invalidate an entire corridor, while aggregate health metrics still look acceptable. [1] Falsifier: show that broad health indicators reliably detected and bounded the impact early, with no corridor-specific blind spot.
H3 — The most reliable operational control is revert discipline plus post-revert verification receipts: you do not just revert; you prove the corridor is back (routes, error budgets, queue convergence). [3] Falsifier: demonstrate that revert-only response (without explicit verification artifacts) yields equal long-run stability and fewer repeat incidents than revert + receipts.

Where it flips (regimes)
Conclusions invert across (1) workloads dominated by queued/background jobs vs real-time request/response flows, (2) systems with strong idempotency + safe retries vs systems where retries amplify harm, (3) organizations with explicit region isolation assumptions vs those treating regions as interchangeable, and (4) environments where users can reroute/shift regions quickly vs environments where they cannot. [5], [6], [7], [10]

Math behind it (without math)
The inference trap is averaging: if “most regions recovered,” you can conclude the system is “fine,” while a minority corridor remains effectively offline or inconsistent. Governance requires corridor-aware signals: you track not just error rates, but the shape of recovery—queue convergence, retry decay, and whether state becomes consistent again for the affected cohort. [1], [2], [5], [7]

Closure target
“Settled/done” means a checkable bundle exists showing: (a) the reverted configuration state is confirmed in effect, (b) region-level traffic and error rates match pre-incident baselines, (c) backlog and retries converged (requeue complete, no long-tail stragglers), (d) customer-visible invariants hold again (no partial reads, stale state, or hidden throttling), and (e) a bounded change-control rule exists to prevent recurrence of the same corridor break. [2], [6], [7], [8]

References
[1] R. Figurelli, “The Subfield Collapse Hypothesis: A Unified Explanation for OOD Inversions and Syntactic Shortcuts,” preprint, 2026.
[2] R. Figurelli, “Time-to-Okay (TTO) as an Agile Health Metric: Measuring Recovery Across Multitime Clocks,” preprint, 2026.
[3] R. Figurelli, “The Rollback Ratio: Turning Irreversibility into Managed Risk,” preprint, 2026.
[4] “Outage in US-East-2 (Ohio),” incident report, 2026.
[5] B. Beyer, C. Jones, J. Petoff, and N. Murphy, Site Reliability Engineering, book, 2016.
[6] NIST, “Computer Security Incident Handling Guide (SP 800-61 Rev. 2),” guideline, 2012.
[7] NIST, “Information Security Continuous Monitoring (ISCM) for Federal Information Systems and Organizations (SP 800-137),” guideline, 2011.
[8] NIST, “Security and Privacy Controls for Information Systems and Organizations (SP 800-53 Rev. 5),” standard, 2020.
[9] ISO/IEC, “Information Security Incident Management,” standard, 2016.
[10] AXELOS, “ITIL 4 Foundation,” framework, 2019.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

When a Region Stops Receiving Traffic, “Resolved” Isn’t the Same as “Recovered”

Share this: