State-Diff Beats Narratives: A Zero-Trust Closure Pattern for Tool-Using Agents
Math Machine: Zero-Trust State-Diff Closure Machine
Source: https://arxiv.org/pdf/2602.11224
Facts
The source reports (submitted February 11, 2026) a benchmark/framework for tool-using agents that execute code against real enterprise APIs inside a controlled sandbox, and it defines task success via a state-diff contract: completion is whether the expected environment-state change occurred between start and end snapshots. It reports an evaluation across nine models and 224 tasks (525 samples via randomized environments), and it highlights a recurring symptom: agents can sound correct while failing to produce the correct final state; it also reports that agent-side context management strategies can materially improve outcomes. Details beyond what the source states are not specified publicly.
Thesis
This is not “just a better benchmark.” It is a closure architecture: a way to force an agent system to prove it changed the world correctly, not that it told a convincing story. In regulated and high-stakes workflows, that is the difference between “peer review of narratives” and “zero-trust closure with receipts.” [1], [3]
What we add / What’s new
- Field Network (subfield→field→metafield→overfield→metaoverfield): subfield = tool calls + state deltas; field = workflow contract; metafield = evaluation regime + admissibility; overfield = production reliability; metaoverfield = institutional trust in automation. [3], [7]
- GeoIT: the Circle of Realization only closes when stakeholders share the same checkable artifact: state change receipts, not explanations. [1]
- TTOkay: “okay-to-operate” becomes a lower-bounded, worst-slice state-diff guarantee, not a mean score or a judge preference. [1], [2]
- Multitime: “done” depends on clocks (tool latency, retries, audit replay, vendor drift); state-diff contracts can bind “done” across clocks. [1], [7]
- ReceiptBench / LLF / LSF link: language is one signal; tool traces + state deltas + invariants form a multi-signal contract that resists “fluent but wrong.” [3]
- W = I ^ C: intelligence (I) produces plans; consciousness (C) is governance that refuses closure without checkable predicates and replay. [1]
- State-diff is a verification surface you can put behind a zero-trust tribunal (who decides, what evidence counts, how replay works). [1]
- Agent improvements that don’t improve replayable state-diff closure are “evaluation theater,” not reliability. [2], [3]
Zero-Trust closure stack
Zero-trust for this agent benchmark means we do not trust the agent’s narrative or “looks correct” traces; we trust only checkable closure. “Done” is decided by a tribunal (the agreed closure authority), using a replayable verifier that evaluates a state predicate, supported by minimal receipts (start/end state hashes, diff summary, invariant checks, and replay inputs). The system must also preserve identity under drift (what still counts as “the same task” as tools and policies evolve), and it must provide a typed HOLD outcome when closure cannot be certified (e.g., missing state surface, nondeterminism, insufficient observability), so non-closure is actionable rather than rhetorical. State-diff contracts are powerful here because they give verification something objective to run: an outcome state predicate. [1]
cMth lens: survivability under collapse
Under collapse pressure (more tools, longer runs, more vendor drift), many evaluation claims lose semantic integrity: the benchmark “means” something different than what production needs. The cMth question is: what portion of the claim survives regime flips? A survivable core here is “ΔS* happened and forbidden changes did not happen,” because it is structurally checkable; what collapses is any narrative-only notion of success (“the trace looked good”). [2], [3]
hPhy lens: heuristic physics, not universal law
Treat “state-diff success” as a heuristic law that holds inside declared regimes: instrumented sandboxes, known state surfaces, and stable diff functions. Outside that regime (hidden side effects, unmodeled state, non-deterministic APIs), the heuristic weakens unless you expand the state surface and invariant set. The method stays honest by saying where the heuristic breaks and what receipts would be required to restore it. [4]
Cub³ lens: cross-domain convergence check
Cub³ asks for three projections and checks coherence:
- Compute projection: can we replay the verifier cheaply and deterministically under budget?
- Math projection: are closure predicates explicit (ΔS*, forbidden set B, dispersion) and stable under regime flips?
- Physics projection: do small perturbations (latency, retries, partial failures) create “turbulence” that breaks closure unless invariants constrain the system?
If these projections disagree, the claim is not converged and must be bounded or placed on HOLD. [3], [4]
Why it matters
Enterprise agents fail in the worst way when they are persuasive but wrong—especially when “wrong” means touching the wrong resource, leaking side effects, or leaving a workflow half-applied. Outcome verification (state-diff + invariants) turns that risk into something that can be measured, budgeted, and governed—so deployment decisions can be defended with receipts rather than stories. [1], [7]
Where it flips (regimes)
Conclusions invert across: (1) mocked tasks vs real interfaces (even in sandboxes), (2) trace/judge scoring vs state-predicate verification, (3) single-step tasks vs multi-step workflows where side effects accumulate, and (4) low-drift environments vs high-drift environments where tools and policies evolve between runs. [2], [7]
Math behind it (without math)
The trap is “narrative closure”: if the agent’s explanation is treated as evidence, the system will optimize for sounding right. State-diff closure flips the incentive: it makes correctness a property of the world’s end state (plus invariants), so fluency cannot substitute for verified outcome. This is exactly the zero-trust instinct applied to agents: never trust, always verify—under an explicit budget. [1], [3]
Math behind it (with math)
Success = 𝟙[ Diff(S_start, S_end) ⊇ ΔS* ∧ Diff(S_start, S_end) ∩ B = ∅ ] [3], [7]
- S_start, S_end: environment state at start and end.
- Diff(·): the computed state difference (what changed).
- ΔS*: required state changes for the task (the contract).
- B: forbidden side-effects (must-not-change invariants).
Rationale: this matches operational truth: you only “did the work” if required changes happened and forbidden changes did not. It also scales: you can expand ΔS* and B as regimes get harder, and you can sample verification under budget.
Hypotheses
H1 — A zero-trust tribunal that uses state-diff + invariants will reduce false closure versus narrative- or judge-based scoring for tool agents. [1] Falsifier: Narrative/judge scoring predicts production wrong-touch incidents and rollback events as well as state-diff invariants across comparable tasks.
H2 — Under collapse pressure (longer runs, more tools, more drift), survivability comes from explicit closure predicates (ΔS*, B) and replay receipts, not from larger context windows. [2] Falsifier: Increasing context alone (without stronger predicates/receipts) maintains worst-slice outcome correctness under the same drift regimes.
H3 — Cub³-style convergence (compute + math + physics projections) predicts which benchmarks will generalize to production reliability better than a single headline score. [3] Falsifier: Single headline scores predict cross-regime production reliability as well as multi-projection coherence checks.
Millennium-problem alignment (and why it matters here)
Operationally, the “verify under budget” reality aligns with P vs NP as an analogy: it’s easy to claim success, harder to certify it across many tasks and regimes without scalable verifiers; we do not claim formal reductions. A second lens is Navier–Stokes existence and smoothness: agent workflows can behave like turbulent systems—small perturbations (retries, latency, partial failures) can amplify into large outcome divergence unless invariants constrain the dynamics. In coevolution terms, as agents scale, governance must evolve from narratives to ledgers; P + NP = 1 becomes a disclosure rule: either you pay verification cost (P-like, replayable receipts) or you accept unverified space (NP-like assumptions), but you must record where that trade was taken, by regime and time. [1], [7], [9]
Multitime + TTOkay (when ‘done’ depends on which clock you trust)
Key clocks include: user clock (time-to-outcome), agent clock (steps/iterations), tool clock (latency, rate limits), retry/backlog clock (queued actions, partial failures), vendor clock (API behavior changes), and audit clock (replay and evidence). TTOkay fails when closure follows the agent clock (“finished”) while the audit clock cannot replay state-diff and invariants, or when retries silently change outcomes after a “success” label was emitted. [1], [7]
Closure target (zero-trust, expanded)
“Settled/done” requires: declared subfields (state surface, diff function, invariant set B, sandbox assumptions, drift assumptions), explicit closure predicates (ΔS* achieved; B untouched; dispersion bounded across seeds; worst-slice passes), and a receipt schema (task ID, tribunal policy, verifier version, ΔS*, B, S_start/S_end hashes, diff summary, tool-call log pointer, timestamps, and replay instructions). It must also declare identity + drift (what makes two runs “the same task” as tools evolve) and a typed HOLD path (e.g., HOLD-STATE-SURFACE, HOLD-DRIFT, HOLD-NONDETERMINISM, HOLD-OBSERVABILITY) with next evidence needed. Verification must be budgeted (sampling plan per task family) and worst-slice oriented, with dispersion reported (not just means). [1], [2], [3]
References
[1] R. Figurelli, “Zero-Trust Science: A New Architecture for Scientific Closure (Beyond Peer Review),” Preprint, 2026.
[2] R. Figurelli, “Collapse Mathematics (cMth): A New Frontier in Symbolic Structural Survivability,” Preprint, 2026.
[3] R. Figurelli, “Cub³: A New Heuristic Architecture for Cross-Domain Convergence,” Preprint, 2026.
[4] R. Figurelli, “Heuristic Physics: Foundations for a Semantic and Computational Architecture of Physics,” Preprint, 2026.
[5] S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” Proc. ICLR, 2023.
[6] C. Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv preprint, 2024.
[7] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering, O’Reilly Media, 2016.
[8] M. Kleppmann, Designing Data-Intensive Applications, O’Reilly Media, 2017.
[9] C. Fefferman, “Existence and Smoothness of the Navier–Stokes Equation,” Clay Mathematics Institute, 2000.
[10] NIST, “AI Risk Management Framework (AI RMF 1.0),” 2023.
