Context Rot Creates Hidden Subfields: Why Long-Context Benchmarks Become Field Contracts
Math Machine: Context-Rot Regime Boundary Machine
License: CC BY 4.0
Source: https://arxiv.org/abs/2602.07962
Facts
On February 8, 2026, the source introduces LOCA-bench, a benchmark for long-running language agents where the context grows dynamically as the agent explores and acts. It argues that existing long-context tests mostly measure single-step retrieval, while real agent work suffers “context rot” as histories expand. LOCA-bench controls environment state growth to regulate context length (in principle extending indefinitely while keeping task meaning fixed), evaluates agents as “model + scaffold” (including context-management strategies), and reports that performance generally degrades as the environment grows more complex while stronger context-management can substantially improve success rates. (arXiv)
What we add / What’s new
LOCA-bench is not only a “long context” benchmark. It is a regime generator: as context grows, the benchmark induces distinct measurable slices (subfields) where “done” becomes harder to certify and errors become harder to recover from. [1]–[10]
This is exactly where pooled means create false closure: an agent can look better on average while collapsing in the long-horizon corridors that dominate real operational risk. Under the LLF lens, those corridors are field contracts becoming visible through instrumentation. [1]–[5]
In other words, “context length” is not a scalar stress test; it is a contract axis that should force reporting around worst-subfield + dispersion + flip points, not a single leaderboard mean. [1], [2], [4], [7], [8]
Why it matters
Most expensive agent failures happen late: after many turns, many tool calls, and many partial decisions. If benchmarks don’t force explicit regimes for growth, admissibility, and closure, teams can ship agents that look reliable on short runs but drift into compounding mistakes, stale assumptions, and unverifiable “done” states as context expands.
Hypotheses
H1 — Long-context agent benchmarks will predict deployment failures better when they report worst-subfield under growth corridors (early vs late horizon) rather than pooled means. [1] Falsifier: pooled means predict operational failures as well as worst-subfield across declared growth regimes.
H2 — Model ordering will flip across growth corridors because different scaffolds and memory rules dominate late-horizon behavior, making “one ranking” non-portable. [2] Falsifier: stable ranking across short vs long corridors with low dispersion and no meaningful flips.
H3 — The dominant driver of false closure under extreme context growth will be weak receipt discipline under budget (what was checked vs assumed), not raw model capability. [3] Falsifier: audit sampling shows receipt completeness stays high under long horizons and does not explain false closure rates.
Where it flips (regimes)
Conclusions invert across: (1) short-horizon vs long-horizon task segments, (2) clean histories vs histories containing stale tool results and partial failures, (3) memory/scaffold choices that compress context vs those that preserve it, and (4) tasks where “done” is obvious vs tasks where “done” requires explicit checkable closure under limited verification effort.
Math behind it (without math)
As context grows, the agent’s decision surface becomes a moving target: it must choose what to keep, what to forget, and what to treat as evidence. That creates hidden slices where a system is effectively solving different problems. A pooled score averages over those slices, masking late-horizon collapse. Contract-first evaluation makes the slices explicit and forces the benchmark to reveal dispersion, worst corridors, and where the ordering flips—signals that matter for authorization. [1], [2], [4], [7], [8], [10]
Closure target
This case is “settled” when LOCA-bench-style results are promotion-grade: (a) growth corridors are declared as first-class regimes, (b) reporting centers worst-subfield + dispersion (not just means), (c) flip points across corridors are published, and (d) receipt completeness is audit-sampled under a declared verification budget, showing these signals predict late-horizon operational failures better than pooled averages. [1]–[4], [10]
References
[1] R. Figurelli, “Benchmark Convergence As Operational Confirmation Of Large Language Fields (LLFs),” preprint, 2026.
[2] R. Figurelli, “Large Language Fields (LLFs): The Invisible Layer Above LLMs,” preprint, 2025.
[3] R. Figurelli, “Field Definition Language (FDL): A Proposal to Evolve APIs into Governed Fields,” preprint, 2025.
[4] W. Zeng, Y. Huang, and J. He, “LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth,” preprint, 2026.
[5] R. Wang et al., “AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition,” preprint, 2026.
[6] H. M. Pysklo, A. Zhuravel, and P. D. Watson, “Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation,” preprint, 2026.
[7] N. F. Liu et al., “Lost in the Middle: How Language Models Use Long Contexts,” preprint, 2023.
[8] C.-P. Hsieh et al., “RULER: What’s the Real Context Size of Your Long-Context Language Models?,” preprint, 2024.
[9] S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” preprint, 2023.
[10] J. Wei et al., “BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents,” preprint, 2025.
