When Agents Are Graded on the Trace, They Learn to Perform for the Judge

When Agents Are Graded on the Trace, They Learn to Perform for the Judge

Math Machine: State-Diff Contract Machine
Release: Open (no DOI)
License: CC BY 4.0

On February 11, 2026, an arXiv preprint introduced Agent-Diff, a benchmarking framework for evaluating LLM agents on enterprise API tasks where the agent executes code against real service interfaces, but inside a sandboxed environment that standardizes execution and scoring. The paper’s key evaluation move is a “state-diff contract”: success is defined by whether the expected change in environment state occurs, rather than by matching an agent’s step-by-step trace. It reports benchmarks across nine LLMs and 224 tasks spanning enterprise workflows.

Why this matters is that many agent benchmarks unintentionally reward “looking busy”: long tool traces, plausible rationales, or near-miss sequences that feel correct but don’t actually complete the task. If you’re trying to use agents for real workflows—tickets, files, messages, calendar actions—what you care about is the outcome landing cleanly, not the narrative of how it got there.

What’s new here (as a Mathine technology demo) is the contract shift: the benchmark tries to separate process from outcome so that agents can’t win by imitating the shape of a solution. That makes the evaluation closer to how reliability is experienced in production: a workflow either changes state correctly, or it doesn’t.

Where this flips (regimes): whether the sandbox truly controls for tooling and environment differences; whether “state change” captures the user’s real intent versus a shallow proxy; whether tasks reward minimal, correct actions or allow “accidental success”; and whether access to documentation or scaffolding changes agent behavior in ways that don’t generalize to real deployments. In these regimes, the same agent can look strong on paper and still feel brittle in the wild.

Closure target: this becomes settled when agent benchmarks routinely publish outcome contracts that are hard to game, report failure modes where agents achieve partial but misleading state changes, and demonstrate that leaderboard improvements persist when environments, prompts, and documentation access are varied—so “better agent” means better outcomes, not better performances.

https://arxiv.org/abs/2602.11224

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

When Agents Are Graded on the Trace, They Learn to Perform for the Judge

Share this: