Context Retrieval Is the Real Bottleneck for Coding Agents

Context Retrieval Is the Real Bottleneck for Coding Agents

Math Machine: Context Retrieval Bottleneck Machine
Release: Open (no DOI)
License: CC BY 4.0

On February 5, 2026 (revised February 11), an arXiv preprint introduced ContextBench, a benchmark designed to evaluate how coding agents retrieve and use code context during issue resolution—not just whether they eventually pass tests. It reports 1,136 issue-resolution tasks from 66 repositories across 8 programming languages, each paired with human-annotated “gold contexts,” and evaluates multiple frontier LLMs and coding-agent scaffolds using recall/precision-style measures for what the agent actually looked at. (arXiv)

Why this matters is that many “agent wins” in software look like magic until you ask the uncomfortable question: did the agent find the right parts of the codebase, or did it stumble into a passing patch? In real engineering, time is lost not in typing code, but in locating the one file, function, or dependency that makes the bug make sense.

What’s new here (as a Mathine technology demo) is the separation of outcome from process. ContextBench makes visible a failure mode that end-to-end success rates can hide: agents may retrieve lots of relevant material but drown it in noise, or they may inspect the right code and still fail to consolidate it into the final change. The result is a kind of “looks busy, learns little” behavior that feels productive in traces but brittle in practice.

Where this flips (regimes): whether the codebase is small and familiar versus large and modular; whether the agent’s tools encourage broad scraping or targeted lookup; whether evaluation rewards final pass rates or intermediate retrieval quality; and whether cost constraints penalize “retrieve everything” strategies that inflate tokens and latency. In some regimes, high recall is a feature; in others, it becomes a fog machine.

Closure target: this becomes settled when agent reports routinely include retrieval quality alongside task success—showing that improvements persist across repo sizes and languages, and demonstrating measurable gains in both precision (less noise) and utilization (retrieved context actually used), rather than only higher pass rates.

https://arxiv.org/abs/2602.05892

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).