← Back to Paper List

When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

Shaowen Wang, Yiqi Dong, Ruinian Chang, Tansheng Zhu, Yuebo Sun, Kaifeng Lyu, Jian Li
Institute for Interdisciplinary Information Sciences, Tsinghua University
arXiv (2025)
Factuality Benchmark Pretraining RL

📝 Paper Summary

Hallucination mitigation Robustness to spurious correlations
The paper demonstrates that spurious correlations (e.g., surname-nationality links) cause confident hallucinations that evade standard detection methods and resist mitigation strategies like refusal fine-tuning and model scaling.
Core Problem
LLMs often hallucinates by overfitting to superficial statistical associations (spurious correlations) rather than learning causal facts, generating errors that are highly confident and consistent.
Why it matters:
  • Existing defenses rely on uncertainty (low confidence) or inconsistency to detect errors, but spurious correlations create 'confident' errors that bypass these checks
  • Common mitigation strategies like refusal fine-tuning fail when models rely on strong shortcut associations
  • These biases persist even in frontier models (GPT-5, DeepSeek-V3), threatening reliability in high-stakes domains
Concrete Example: A model might hallucinate that an individual named 'Ivanov' was born in Russia solely because of the surname suffix '-ov', ignoring the actual ground truth in its training data. Under high spurious correlation, the model consistently and confidently outputs 'Russia' instead of the correct answer.
Key Novelty
Systematic evaluation of Spurious-Correlation-Induced Hallucinations
  • Introduces a controlled synthetic framework where the correlation strength (ρ) between features (e.g., surname) and attributes (e.g., birthplace) is precisely manipulated to measure impact on hallucinations
  • Uses 'entity co-occurrence' in Wikipedia as a real-world proxy for spurious correlation to validate findings on frontier models like GPT-5 and DeepSeek-V3
  • Theoretically proves that models generalizing well via kernel learning inevitably rely on these correlations, making confidence-based detection fundamentally difficult
Evaluation Highlights
  • Hallucination detection methods (e.g., perplexity, linear probing) degrade to near-random performance as spurious correlation strength (ρ) increases from 0 to 0.9
  • Refusal fine-tuning fails to mitigate these errors; models refuse less often and recall fewer facts as spurious correlations strengthen, even at 1B parameter scale
  • Real-world validation on SimpleQA shows that higher entity co-occurrence (proxy for spurious correlation) consistently increases model confidence in incorrect answers across GPT-5 and DeepSeek-V3
Breakthrough Assessment
8/10
Identifies a fundamental failure mode in current hallucination detection paradigms. The controlled synthetic setup provides clear causal evidence that spurious correlations break confidence-based defenses, a significant insight for safety research.
×