When Citations Don’t Stop Hallucinations, They Just Move the Failure

When Citations Don’t Stop Hallucinations, They Just Move the Failure

Math Machine: Cascade Groundedness Audit Machine
License: CC BY 4.0
Source: https://arxiv.org/abs/2602.01031

Facts
On Feb 1, 2026, the source introduces a multi-turn hallucination benchmark designed to test whether models can keep factual claims grounded as conversations grow and early mistakes compound. The source reports 950 seed questions across four high-stakes domains (legal cases, research questions, medical guidelines, and coding), and it operationalizes “groundedness” by requiring inline citations for factual assertions. The source also describes an evaluation pipeline that uses web retrieval to check whether cited material actually supports claims (including when full-text sources are PDFs). It reports that hallucinations remain substantial even with retrieval, with the strongest configuration still around a ~30% hallucination rate; additional operational details (such as exact production mitigations) are not specified publicly.

What we add / What’s new
This is not mainly about “adding citations.” Our addition is a governance translation: citations shift the failure from “no source” to “source mismatch,” where a model can cite something while still making a claim the source does not support. That is a contract problem—what counts as admissible evidence—not a formatting problem. [1]–[3]

We add a practical audit lens: multi-turn settings create “grounding cascade,” where one early unsupported claim becomes a premise that later turns treat as settled. Retrieval can then become a polishing tool (finding plausible documents) rather than a truth tool (preventing the first bad premise). [1]–[2]

We also add an operational reading of the reported ~30%: it implies that “web-enabled” does not mean “audit-ready.” If you cannot replay the chain from claim → quoted support → correct interpretation, citations can increase confidence while leaving correctness unchanged. [2]–[3]

Why it matters
Teams often treat citations as a universal fix for reliability. In reality, citations can produce a more dangerous failure: confident-looking answers that appear documented but are still wrong. This raises the cost of review (you must verify support, not just presence), and it increases the risk of silent adoption in workflows where people assume “cited” means “checked.”

Hypotheses
H1 — The dominant error class under citation requirements is “support drift”: sources are relevant to the topic but do not actually justify the specific claim being made. [1]–[2] Falsifier: If most errors are instead “no citation provided” (rather than mismatch between claim and cited support), this is wrong.
H2 — Multi-turn groundedness fails primarily through early-premise injection: one unsupported statement becomes a stable anchor that later turns rationalize with retrieval. [1]–[3] Falsifier: If removing the first-turn error does not materially improve later-turn groundedness, this is wrong.
H3 — The strongest improvements will come from evidence-custody rules (what kinds of claims require what kinds of support) rather than better retrieval alone, because retrieval without custody still permits “plausible but non-supporting” citations. [2]–[3] Falsifier: If better retrieval alone (without custody rules) eliminates claim-support mismatches, this is wrong.

Where it flips (regimes)
Conclusions invert across (1) single-turn Q&A vs multi-turn dialogue, (2) low-stakes trivia vs high-stakes domains, (3) “citation present” scoring vs “citation supports claim” scoring, and (4) short context vs long context where earlier errors can persist. In one regime, citations are a helpful hint; in another, they become a confidence amplifier for unsupported chains.

Math behind it (without math)
The trap is compounding: if each turn has a small chance of inserting an unsupported premise, then over multiple turns the chance that the conversation contains at least one bad premise rises quickly. Once a bad premise exists, later turns can appear increasingly consistent while remaining wrong, because consistency is easier than re-validation. Citations don’t automatically break this loop; they can simply attach plausible paperwork to a drifting narrative. [1]–[2]

Closure target
This is “settled/done” when an evaluation (and product behavior) can demonstrate: (a) claim-level support checking (not just citation presence), (b) explicit handling of uncertainty when support is missing or ambiguous, (c) resistance to early-premise cascades (later turns do not inherit unsupported premises as facts), and (d) a replayable receipt format that lets auditors verify that each key claim is actually supported by what was cited.

References
[1] R. Figurelli, “The Subfield Collapse Hypothesis: A Unified Explanation for OOD Inversions and Syntactic Shortcuts,” preprint, 2026.
[2] R. Figurelli, “Beyond DIKW: A Future-Proof Model of Computable Wisdom for Agentic AI,” preprint, 2026.
[3] R. Figurelli, “Multiple Wisdoms: The Line Between Can and Should,” preprint, 2025.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).