When Citations Don’t Stop Hallucinations, They Just Move the Failure
Math Machine: Multi-Turn Grounding Cascade Machine
License: CC BY 4.0
On February 1, 2026, an arXiv preprint introduced HalluHard, a multi-turn hallucination benchmark designed to measure groundedness as conversations grow and early errors cascade. It includes 950 seed questions across high-stakes domains (legal cases, research questions, medical guidelines, and coding) and operationalizes groundedness by requiring inline citations for factual assertions. The paper also describes a judging pipeline that uses web search to retrieve evidence and assess whether the cited material actually supports the generated content, including when sources are full-text documents like PDFs. It reports that hallucinations remain substantial even with web search, with the strongest configuration still showing around a 30% hallucination rate.
Why this matters is that “adding citations” is often treated as a universal fix for reliability. This benchmark reframes the problem: citations can be present while grounding is still broken—because the failure mode shifts from “no evidence” to “evidence that doesn’t actually support what was said,” especially as dialogue length increases and earlier mistakes propagate.
What’s new here (as a Mathine technology demo) is the focus on multi-turn compounding and citation verification as a first-class evaluation object. Instead of asking whether a model can produce plausible answers with references, it asks whether the system can keep the evidence contract intact over time—when context grows, turns accumulate, and the temptation to paper over uncertainty with confident prose increases.
Where this flips (regimes): whether the task domain has crisp, easily retrievable sources versus ambiguous or interpretive material; whether later turns depend tightly on earlier claims (amplifying cascades); whether retrieval returns authoritative sources or noisy/partial ones; and whether the judge can reliably parse and validate full-text evidence at scale. In these regimes, “with web search” can look like progress while still leaving a large surface for unsupported claims to slip through.
Closure target: this becomes settled when models are evaluated not only on “having citations,” but on citation support quality across turns—reporting breakdowns by turn position and domain, and showing that improvements persist when retrieval conditions and source formats vary. The decisive outcome is a stable reduction in unsupported claims across long dialogues, not a cosmetic increase in citation density.
