State-Diff Benchmarks: When Agents “Sound Done” but the World Didn’t Change

State-Diff Benchmarks: When Agents “Sound Done” but the World Didn’t Change

Math Machine: State-Diff Outcome Contract Machine
License: CC BY 4.0
Source: https://arxiv.org/abs/2602.11224

Facts
The source reports a preprint submitted on February 11, 2026 introducing a benchmark framework for evaluating tool-using AI agents on enterprise software workflows. It describes a “state-diff contract” approach that defines success by whether the expected change in the environment state occurred (rather than matching a specific action trace), and a sandboxed scripting layer used to execute code against real API interfaces, including Slack, Box, Linear, and Google Calendar. The source reports results for nine models across 224 tasks and includes ablation experiments about access to API documentation; user-visible failure in this setting is typically “the agent says it completed the task,” but the underlying workspace state is unchanged or only partially changed—details beyond what is written are not specified publicly. (arXiv)

What we add / What’s new
We treat “agent success” as a contract about the world, not a story about the agent. If success is defined by state change, then verification becomes simpler, cheaper, and more audit-friendly: you check the diff, not the narrative. [1], [2], [4]

We separate process plausibility from outcome truth. Many evaluations reward clean-looking steps (or convincing explanations) even when the environment didn’t move; state-diff contracts force the score to follow the world. [1], [3], [4]

We add a governance boundary: whenever actions touch shared systems (tickets, documents, calendars, messages), the admissible unit is not “a helpful attempt,” but “a verified change or an explicit non-change.” This reduces silent partials and false closure. [2], [3], [8]

Why it matters
In enterprise settings, the cost of an agent mistake is often not one wrong message—it’s time lost to rework, duplicated actions, and trust erosion when humans must repeatedly confirm what actually happened. Outcome-defined evaluation pushes systems toward behavior that is verifiable and dependable, not merely fluent. [8], [10]

Hypotheses
H1 — Outcome contracts (state-diff success) will expose a large hidden failure class: agents that narrate completion while producing no verified state change. [1] Falsifier: show that narrative-based success metrics predict real state change with near-perfect precision in the same tasks.
H2 — Benchmarks that score process rather than state systematically overestimate reliability in multi-step tool use, especially when partial success looks “close enough” in text but is operationally wrong. [2] Falsifier: show that process-based scoring matches state-based scoring across tasks with negligible divergence and no concentrated tail failures.
H3 — The most robust deployment control is “fail-closed on missing diff”: if the expected state change is not observable, the agent must escalate or abstain rather than claiming success. [3] Falsifier: show that allowing “soft success” (no diff, but plausible narrative) yields equal or better operational outcomes under audit sampling.

Where it flips (regimes)
Conclusions invert across (1) tasks where state is fully observable vs tasks where state is delayed, cached, or permission-gated, (2) systems where actions are idempotent vs systems where retries create duplicates, (3) workflows with single-owner artifacts vs shared artifacts with concurrent edits, and (4) low-stakes personal tasks vs high-stakes organizational tasks where “saying done” without proof is unacceptable. [8], [10]

Math behind it (without math)
If you score agents by how reasonable their steps sound, you reward the ability to produce a convincing path—especially when the environment is noisy and the truth is hard to observe. A state-diff contract changes the incentive: the only thing that counts is whether the world reached the target state. That single shift reduces a common evaluation illusion—high average “looks-correct” behavior masking a tail of “no-change” failures that dominate real operational cost. [1], [4], [8]

Closure target
“Settled/done” means you can show: (a) a clear definition of the target state for each task, (b) an observable diff that confirms the target state was reached (or a recorded reason it cannot be observed), (c) explicit handling for partial diffs (what counts as incomplete and what happens next), and (d) audit sampling that reproduces the same verdicts without trusting the agent’s narrative. [2], [3], [4], [8]

References
[1] R. Figurelli, “The Subfield Collapse Hypothesis: A Unified Explanation for OOD Inversions and Syntactic Shortcuts,” preprint, 2026.
[2] R. Figurelli, “A Unified Field Theory (UFT) for SLMs and LLMs: From Latent Capability to Governed Subfields,” preprint, 2026.
[3] R. Figurelli, “The Agency Collapse Hypothesis: Decoupling Narrative from Execution in LLM Agents,” preprint, 2026.
[4] H. M. Pysklo, A. Zhuravel, and P. D. Watson, “Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation,” preprint, 2026. (arXiv)
[5] S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” paper, 2023.
[6] T. Schick et al., “Toolformer: Language Models Can Teach Themselves to Use Tools,” paper, 2023.
[7] J. Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” benchmark, 2024.
[8] NIST, “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” framework, 2023.
[9] ISO/IEC, “Systems and Software Quality Requirements and Evaluation (SQuaRE) — ISO/IEC 25010,” standard, 2011.
[10] B. Beyer, C. Jones, J. Petoff, and N. Murphy, Site Reliability Engineering, book, 2016.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

State-Diff Benchmarks: When Agents “Sound Done” but the World Didn’t Change

Share this: