Long-Horizon Agents Fail Quietly: Constraint Debt in Multi-Turn Tool Use

Long-Horizon Agents Fail Quietly: Constraint Debt in Multi-Turn Tool Use

Math Machine: Constraint Corridor Ledger Machine
License: CC BY 4.0
Source: https://arxiv.org/abs/2602.01675

Facts On Feb 2, 2026, the source describes a long-horizon benchmark for interactive agents in realistic travel-planning conversations, aimed at testing whether agents can enforce global requirements, coordinate many tools, and adapt to changing user inputs across long dialogues. It reports 18 curated tools, 40+ travel requirements, dialogues up to 15 user turns, and contexts that can exceed 200k tokens, with “hard” subsets designed to stress ambiguity, shifting style, feasibility changes, and iterative revisions. It reports that even advanced models reach at most 50% success on the easy split and drop below 10% on hard subsets, and it also proposes an online reinforcement-learning method (GTPO) that improves constraint satisfaction and interaction robustness in their evaluation; further operational limitations beyond what’s in the abstract are not specified publicly.

What we add / What’s new
The source is easy to misread as “models are weak at travel.” Our addition is a governance-relevant translation: this is a stress test for global constraint custody—the ability to keep the user’s non-negotiables intact while the conversation evolves, tools return messy outputs, and the plan is revised. That custody problem sits above “answer quality” and behaves like a field contract. [1]–[3]

We also add a failure taxonomy that is audit-friendly: not “wrong answer,” but “constraint debt”—small, locally plausible deviations that accumulate until the final plan violates something the user said was mandatory (budget, timing, feasibility, or policy-like requirements). This aligns with subfield collapse patterns where the system stays fluent while silently exiting the admissible region. [1]–[2]

Finally, we add an interpretation discipline for the reported score drops: the hard split isn’t just “harder tasks,” it is a regime shift where the user’s rules change over time and the agent must re-justify the plan under new constraints—an environment where narrative continuity can conflict with constraint truth. [1]–[3]

Why it matters
In production, long-horizon agent failures rarely look like dramatic crashes; they look like a plan that seems coherent until the last step, when a hidden violation surfaces (or never surfaces). When success depends on keeping “global requirements” intact, the operational risk is not only wasted time—it is misplaced trust: users start delegating decisions to a system that is optimized for local plausibility rather than long-horizon admissibility.

Hypotheses
H1 — Most “hard split” failures concentrate in late-stage revisions, where the agent must preserve earlier non-negotiables while incorporating new constraints, and the plan quietly exits the admissible corridor. [1]–[2] Falsifier: If failure rates are uniform across early, mid, and late turns (no late-stage concentration), this hypothesis is wrong.
H2 — The biggest reliability gains come from making constraints first-class and persistent (explicit custody), not from improving tool-calling skill alone; otherwise, the agent will remain fluent-but-invalid. [1]–[3] Falsifier: If models that improve tool-call correctness do not improve constraint satisfaction, while models with constraint-custody mechanisms do, the hypothesis holds; if the opposite pattern appears, it fails.
H3 — Reported improvements from online training will be regime-bounded: they will help on stable requirement families but degrade when requirements shift style or feasibility midstream, indicating over-specialization to a fixed interaction rhythm. [2]–[3] Falsifier: If improvements remain stable (or improve) under deliberate style shifts and feasibility flips, this hypothesis is wrong.

Where it flips (regimes) Conclusions invert across (1) short vs long interactions (few turns vs many revisions), (2) stable vs evolving requirements (fixed preferences vs midstream changes), (3) low-tool vs high-tool settings (few external calls vs dense tool-mediated planning), and (4) “soft preferences” vs “hard constraints” (nice-to-have vs must-not-violate). In the first regime, fluency can approximate correctness; in the latter regimes, fluency becomes a liability unless constraints are actively guarded.

Math behind it (without math)
The trap is simple: local steps can each look reasonable while the global plan becomes invalid. Every revision is a chance to “borrow” against constraints—substituting a near-match for an exact requirement, or letting one tool output override an earlier promise—because the agent is rewarded for coherence and progress. Over many turns, those small borrowings compound into a constraint violation that still reads smoothly. That compounding is what makes long-horizon evaluation qualitatively different from single-shot benchmarks. [1]–[2]

Closure target
This line of work is “settled” (for publication-grade operational use) when we can show, with checkable evidence, that an agent (a) preserves non-negotiables across revisions, (b) surfaces conflicts explicitly when constraints cannot be met, (c) maintains performance when user requirements shift midstream, and (d) produces an audit trail that lets an evaluator replay why each constraint was satisfied or knowingly violated. The closure is not “higher average score,” but a stable corridor: predictable constraint custody across the key regimes above.

References
[1] R. Figurelli, “The Subfield Collapse Hypothesis: A Unified Explanation for OOD Inversions and Syntactic Shortcuts,” preprint, 2026, https://doi.org/10.5281/zenodo.18574427.
[2] R. Figurelli, “Beyond DIKW: A Future-Proof Model of Computable Wisdom for Agentic AI,” preprint, 2026, https://doi.org/10.5281/zenodo.18238392.
[3] R. Figurelli, “Compositional Clock Theory: Temporal State Machines and Multitime,” Zenodo, 2026, doi: https://doi.org/10.5281/zenodo.18342870.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).