Budget Growth Can Break LLM Routing
Math Machine: Routing Collapse Machine
Release: Open (no DOI)
License: CC BY 4.0
On February 3, 2026, an arXiv preprint described a failure mode in LLM routers: as a user’s cost budget increases, routers can start defaulting to the most capable (and most expensive) model—even in cases where cheaper models would already suffice. The paper calls this “routing collapse,” and attributes it to an objective mismatch: routers may learn to predict scalar scores, while routing requires discrete model comparisons that can flip with small errors. It proposes a decision-aware router that learns rankings and reports cost reductions at a fixed performance level.
Why this matters is that routing is supposed to be the invisible efficiency layer that makes AI cheaper without making it worse. If the router collapses upward when budgets rise, the system stops being “cost-aware” and becomes a spend escalator: users pay more and often don’t notice until invoices or latency patterns force the realization.
What’s new here (as a Mathine technology demo) is the shift from “routing is optimization” to “routing is a stability problem.” A router can look strong in average quality metrics while failing at the exact contract it was hired to enforce: keep cheaper models in play whenever they are good enough. In other words, the breakdown isn’t in the models—it’s in the decision boundary that chooses between them.
Where this flips (regimes): how the router is trained (predicting scores versus learning comparisons), how tight performance differences are between candidate models, how budgets are expressed and enforced, and how much distribution shift exists between training prompts and real traffic. In these regimes, a router can appear to “improve” while quietly losing its ability to discriminate and thus over-selecting the expensive option.
Closure target: this becomes settled when routing evaluations routinely report not only aggregate quality and cost, but also the router’s utilization curve across budgets—showing that smaller models remain meaningfully used when they suffice, and that improvements persist across prompt distributions and model updates rather than collapsing into “always pick the biggest model.”
