Capability Thresholds Get Fuzzy: Publishing Risk Reports Before It’s Certain

Capability Thresholds Get Fuzzy: Publishing Risk Reports Before It’s Certain

Math Machine: Threshold Governance Gate Machine
License: CC BY 4.0
Source: https://www.anthropic.com/rsp-updates?939688b5_page=2&web=1

Facts
On Feb 10, 2026, the source updates a “responsible scaling” governance page and states that its policy requires an affirmative risk case once a model crosses a specified capability threshold; it reports a determination that a named frontier model (Claude Opus 4.6) does not cross that threshold. The source also states that confidently ruling out the threshold is becoming increasingly difficult and more subjective than desired, and it reports a commitment to publish “sabotage risk” reports for future frontier models exceeding an earlier capability level; it further reports that an external-facing sabotage risk report was prepared for Claude Opus 4.6 and is being published. User-visible symptom: the public record changes from “trust our internal call” to “here is an external-facing risk artifact,” with stated uncertainty about the precision of the threshold call; details beyond what is written are not specified publicly.

What we add / What’s new
This is not “a safety page update.” It’s a disclosure that the gate itself is noisy: the hardest part is no longer measuring the model, but deciding when the measurement is “confident enough” to trigger stricter governance. That is a field problem (admissibility and authority), not a model problem. [1]–[3]

We add a practical distinction: threshold calls (yes/no) versus threshold evidence (receipts). The source moves toward evidence by publishing external-facing risk documentation, which is closer to auditability than a bare determination. [2]–[3]

We also add a failure mode to watch: when thresholds become subjective, organizations tend to drift toward “plausible deniability by ambiguity,” unless they formalize what would falsify their own call and what evidence must exist before the gate can be trusted. [1]–[2]

Why it matters
If the public governance story becomes “we can’t confidently tell when we crossed the line,” then both overreaction and underreaction become likely: overreaction slows deployment unnecessarily, underreaction ships capability without the safeguards the policy is meant to trigger. The operational cost is uncertain at the exact moment leadership needs crisp, checkable reasons to tighten controls.

Hypotheses
H1 — As capability thresholds become more subjective, the system will increasingly rely on secondary artifacts (external-facing risk reports) to maintain legitimacy, because the binary threshold label no longer carries enough trust. [2]–[3] Falsifier: If threshold determinations remain stable and consistently supported by clear, non-subjective tests without additional artifacts, this is wrong.
H2 — The highest governance risk is not the wrong threshold call; it is unclear falsifiers for the call, which turns the gate into a narrative rather than a decision rule. [1]–[2] Falsifier: If the published framework (and its updates) consistently includes concrete conditions that would reverse a determination, this is wrong.
H3 — Publishing external-facing risk reports will reduce downstream “silent disagreement” inside organizations (policy, engineering, safety) by forcing a shared evidentiary baseline—but only if the report includes explicit boundaries of uncertainty. [2]–[3] Falsifier: If internal disputes and inconsistent application increase after publication (more fragmentation, not less), this is wrong.

Where it flips (regimes)
Conclusions invert across (1) stable capability scaling versus abrupt capability jumps, (2) internal-only governance versus governance with external-facing artifacts, (3) thresholds backed by objective tests versus thresholds requiring expert judgment, and (4) low-stakes deployment contexts versus contexts where the cost of being wrong is catastrophic. In the first regime, a simple gate can be “good enough”; in the latter regimes, “good enough” becomes ungovernable without receipts.

Math behind it (without math)
The inference trap is treating a threshold as a clean line when the measurement is noisy and partly subjective. When a decision depends on a line, small uncertainty around that line can dominate outcomes: the same evidence can justify opposite actions depending on how you weigh ambiguity. Publishing a risk report is one way to compress that ambiguity into a shared artifact—but it only works if the report is explicit about what is known, what is uncertain, and what would change the conclusion. [1]–[3]

Closure target
This topic is “settled/done” when the public governance record makes the gate auditable: (a) the threshold decision is tied to observable tests or documented expert-judgment criteria, (b) the uncertainty is bounded (what is hard to rule out, and why), (c) there are explicit falsifiers that would reverse the determination, and (d) external-facing artifacts are consistent over time (no shifting standards without a documented reason). The closure is not “a confident statement,” but a stable, replayable basis for policy-triggered safeguards.

References
[1] R. Figurelli, “The Subfield Collapse Hypothesis: A Unified Explanation for OOD Inversions and Syntactic Shortcuts,” preprint, 2026. https://doi.org/10.5281/zenodo.18574428
[2] R. Figurelli, “Multiple Wisdoms: The Line Between Can and Should,” preprint, 2025. https://doi.org/10.5281/zenodo.18057785
[3] R. Figurelli, “Beyond DIKW: A Future-Proof Model of Computable Wisdom for Agentic AI,” preprint, 2026. https://doi.org/10.5281/zenodo.18238392

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).