Perfect Scores Can Still Teach the Wrong Lesson

Perfect Scores Can Still Teach the Wrong Lesson

Math Machine: Task-Decomposition Reliability Machine
License: CC BY 4.0

On February 8, 2026, an arXiv preprint presented Minitap, a multi-agent system reported to achieve 100% success on the AndroidWorld benchmark (116 tasks), exceeding a cited human baseline (80%). The paper attributes single-agent failures to issues like context pollution, silent text input failures, and repetitive action loops, and describes a six-agent decomposition plus deterministic post-validation and “meta-cognitive” cycle detection; it also reports ablation gains for decomposition, verified execution, and meta-cognition.

Why this matters is that “perfect accuracy” can be read as “agents are solved,” when the deeper message is about failure detection and recovery. If text entry can fail silently, or loops can repeat without an escape hatch, then reliability is not primarily about clever planning—it’s about building mechanisms that notice when the world didn’t change as expected, and then forcing a state correction rather than letting the agent keep narrating progress.

What’s new here (as a Mathine technology demo) is the shift from a single agent trying to do everything to a system that treats mistakes as structured events: isolate cognition, verify the effect of actions against device state, and explicitly detect cycles. That is a different reliability contract than “the model will reason its way out”—it’s “the system will not proceed unless the state confirms the step landed.”

Where this flips (regimes): whether the benchmark permits silent failures that a real UI would also permit; whether decomposition improves robustness or merely adds budgeted redundancy; whether the verification checks generalize beyond AndroidWorld’s task distribution; and whether the system remains stable under latency, partial observability, or UI variation. In those regimes, “perfect on the benchmark” can coexist with fragility in the wild.

Closure target: this becomes settled when the reported reliability gains replicate under controlled perturbations—UI variation, injected silent failures, delayed state propagation, and adversarial loop triggers—and when the success profile is reported not only as pass/fail, but as a breakdown of which failure classes were prevented by verification and cycle-detection rather than by added compute or orchestration.

https://arxiv.org/abs/2602.07787

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

Perfect Scores Can Still Teach the Wrong Lesson

Share this: