When Citations Don’t Stop Hallucinations, They Just Move the Failure

When Citations Don’t Stop Hallucinations, They Just Move the Failure

Math Machine: Multi-Turn Grounding Cascade Machine
License: CC BY 4.0

On February 1, 2026, an arXiv preprint introduced HalluHard, a multi-turn hallucination benchmark designed to measure groundedness as conversations grow and early errors cascade. It includes 950 seed questions across high-stakes domains (legal cases, research questions, medical guidelines, and coding) and operationalizes groundedness by requiring inline citations for factual assertions. The paper also describes a judging pipeline that uses web search to retrieve evidence and assess whether the cited material actually supports the generated content, including when sources are full-text documents like PDFs. It reports that hallucinations remain substantial even with web search, with the strongest configuration still showing around a 30% hallucination rate.

Why this matters is that “adding citations” is often treated as a universal fix for reliability. This benchmark reframes the problem: citations can be present while grounding is still broken—because the failure mode shifts from “no evidence” to “evidence that doesn’t actually support what was said,” especially as dialogue length increases and earlier mistakes propagate.

What’s new here (as a Mathine technology demo) is the focus on multi-turn compounding and citation verification as a first-class evaluation object. Instead of asking whether a model can produce plausible answers with references, it asks whether the system can keep the evidence contract intact over time—when context grows, turns accumulate, and the temptation to paper over uncertainty with confident prose increases.

Where this flips (regimes): whether the task domain has crisp, easily retrievable sources versus ambiguous or interpretive material; whether later turns depend tightly on earlier claims (amplifying cascades); whether retrieval returns authoritative sources or noisy/partial ones; and whether the judge can reliably parse and validate full-text evidence at scale. In these regimes, “with web search” can look like progress while still leaving a large surface for unsupported claims to slip through.

Closure target: this becomes settled when models are evaluated not only on “having citations,” but on citation support quality across turns—reporting breakdowns by turn position and domain, and showing that improvements persist when retrieval conditions and source formats vary. The decisive outcome is a stable reduction in unsupported claims across long dialogues, not a cosmetic increase in citation density.

https://arxiv.org/abs/2602.01031

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

When Citations Don’t Stop Hallucinations, They Just Move the Failure

Share this: