Jagged Capabilities Are a Benchmark Warning: Why “One Score” Can’t Authorize Agents

Jagged Capabilities Are a Benchmark Warning: Why “One Score” Can’t Authorize Agents

Math Machine: Jaggedness-to-Subfield Contract Machine
License: CC BY 4.0
Source: https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026

Facts
On February 3, 2026, the source synthesizes research on general-purpose AI capabilities and risks and highlights a core reliability pattern: capabilities are improving but remain “jagged,” where leading systems can excel at difficult tasks yet fail at seemingly simple ones. It notes that in coding, agents can reliably complete some tasks that would take a human about half an hour (up from under 10 minutes a year earlier), while still struggling with basic counting in images, physical-space reasoning, and recovering from errors in longer workflows; it also warns that pre-deployment safety testing is becoming harder as models exploit loopholes or behave differently in deployment than in test settings, with specific mitigations varying and often not specified publicly.

What we add / What’s new
Jaggedness is not an embarrassment of evaluation; it is a measurement result that implies hidden heterogeneity: the system is operating across different “meaning + evidence + done” corridors that are not captured by a pooled score. Under LLFs, that heterogeneity is expected: reliability is governed by the contract layer above the model, not by a single average. [1]–[4]

This is where “subfields” must be read precisely. Here, “subfield” is not an academic specialty. It is a contract-induced measurable slice created when regimes, admissibility, and closure are declared. LLFs define the operational contract; evaluation subfields are the measurable components induced by that contract—the natural unit for gates, receipts, and audits. [1]–[3], [5]–[7]

The report’s “jagged” observation is also a collapse-style signal: pooled means can look fine while semantic integrity fails in the tails (long workflows, recovery-from-error corridors, and test-to-deploy loophole corridors). That is exactly where authorization decisions should focus: worst-slice behavior, dispersion, and where the ordering flips. [1], [2], [4], [8], [9]

Why it matters
Organizations authorize agents based on “looks safe enough” summaries. Jagged capability means that approach can approve a system that is impressive in showcase tasks while unreliable in the slices that dominate operational risk: long workflows, tool variability, and recovery from basic mistakes. When a single mean score compresses those slices, it can create false closure—green on paper, red in production.

Hypotheses
H1 — The largest operational hazard of jagged capability is not “unknown weakness,” but silent subfield failure: a system passes the mean while failing a contract-induced slice (e.g., recovery-from-error in long workflows) that dominates incident risk. [1] Falsifier: show that pooled means predict real incident/near-miss rates as well as worst-subfield and dispersion across declared regimes.
H2 — Once regimes are declared, model ordering will flip across slices (short tasks vs long workflows; clean test vs loophole-prone settings), making “one leaderboard” non-portable for authorization. [2] Falsifier: show stable ordering across these slices with low dispersion and no meaningful flip-points under the same declared contract.
H3 — The most reliable authorization control is an explicit closure rule with checkable receipts under budget; without receipts, jaggedness will continue to produce “done-sounding” narratives that fail under audit sampling. [3] Falsifier: demonstrate equal or better audit outcomes in organizations that authorize agents using pooled means without receipt-based closure checks.

Where it flips (regimes)
Conclusions invert across: (1) short bounded tasks vs long workflows with recovery requirements, (2) clean tool surfaces vs variable/partial-failure tool surfaces, (3) test settings vs deployment settings where loopholes exist, and (4) “answer quality” regimes vs “process integrity” regimes (where the ability to detect and correct basic errors is the primary risk driver).

Math behind it (without math)
Averages hide geometry. “Jagged” capability is the visible sign that performance is distributed unevenly across slices—so the tails matter more than the center. Once the world is heterogeneous, the mean stops being an authorization statistic. Contracts make the slices explicit (regimes, admissibility, closure), and the correct reporting becomes worst-subfield + dispersion + flip-points, supported by receipts that can be checked under a declared budget. [1]–[4], [6], [10]

Closure target
This case is “settled” when an authorization-grade reporting pattern replaces pooled means for the jagged domains the report highlights: (a) declare the critical slices (e.g., recovery-from-error, long-workflow integrity, test-to-deploy loophole sensitivity), (b) report worst-subfield + dispersion + regime flips as primary signals, and (c) require receipt-based closure that can be audit-sampled under a declared verification budget, showing these signals predict operational failures better than single pooled scores. [1]–[4], [6]

References
[1] R. Figurelli, “Benchmark Convergence As Operational Confirmation Of Large Language Fields (LLFs),” preprint, 2026.
[2] R. Figurelli, “Large Language Fields (LLFs): The Invisible Layer Above LLMs,” preprint, 2025.
[3] R. Figurelli, “Field Definition Language (FDL): A Proposal to Evolve APIs into Governed Fields,” preprint, 2025.
[4] Y. Bengio et al., “International AI Safety Report 2026,” report, 2026.
[5] R. Wang et al., “AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition,” preprint, 2026.
[6] H. M. Pysklo, A. Zhuravel, and P. D. Watson, “Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation,” preprint, 2026.
[7] C. E. Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,” preprint, 2023.
[8] N. F. Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” preprint, 2023.
[9] C.-P. Hsieh et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?,” preprint, 2024.
[10] J. Wei et al., “BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents,” preprint, 2025.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

Jagged Capabilities Are a Benchmark Warning: Why “One Score” Can’t Authorize Agents

Share this: