When Bias Pretends to Be Truth: How Spurious Correlations Undermine Hallucination Detection in LLMs

📝 Paper Summary

Hallucination mitigation Robustness to spurious correlations

The paper demonstrates that spurious correlations (e.g., surname-nationality links) cause confident hallucinations that evade standard detection methods and resist mitigation strategies like refusal fine-tuning and model scaling.

Core Problem

LLMs often hallucinates by overfitting to superficial statistical associations (spurious correlations) rather than learning causal facts, generating errors that are highly confident and consistent.

Why it matters:

Existing defenses rely on uncertainty (low confidence) or inconsistency to detect errors, but spurious correlations create 'confident' errors that bypass these checks
Common mitigation strategies like refusal fine-tuning fail when models rely on strong shortcut associations
These biases persist even in frontier models (GPT-5, DeepSeek-V3), threatening reliability in high-stakes domains

Concrete Example: A model might hallucinate that an individual named 'Ivanov' was born in Russia solely because of the surname suffix '-ov', ignoring the actual ground truth in its training data. Under high spurious correlation, the model consistently and confidently outputs 'Russia' instead of the correct answer.

Key Novelty

Systematic evaluation of Spurious-Correlation-Induced Hallucinations

Introduces a controlled synthetic framework where the correlation strength (ρ) between features (e.g., surname) and attributes (e.g., birthplace) is precisely manipulated to measure impact on hallucinations
Uses 'entity co-occurrence' in Wikipedia as a real-world proxy for spurious correlation to validate findings on frontier models like GPT-5 and DeepSeek-V3
Theoretically proves that models generalizing well via kernel learning inevitably rely on these correlations, making confidence-based detection fundamentally difficult

Evaluation Highlights

Hallucination detection methods (e.g., perplexity, linear probing) degrade to near-random performance as spurious correlation strength (ρ) increases from 0 to 0.9
Refusal fine-tuning fails to mitigate these errors; models refuse less often and recall fewer facts as spurious correlations strengthen, even at 1B parameter scale
Real-world validation on SimpleQA shows that higher entity co-occurrence (proxy for spurious correlation) consistently increases model confidence in incorrect answers across GPT-5 and DeepSeek-V3

Breakthrough Assessment

8/10

Identifies a fundamental failure mode in current hallucination detection paradigms. The controlled synthetic setup provides clear causal evidence that spurious correlations break confidence-based defenses, a significant insight for safety research.

⚙️ Technical Details

Problem Definition

Setting: Fact-based Question Answering under varying degrees of training data bias

Inputs: Natural language questions about individual attributes (e.g., 'Where was [Name] born?')

Outputs: Predicted attribute value (e.g., city name) or refusal token

Pipeline Flow

Synthetic Data Generation (Profiles + Templates)
Correlation Injection (Control parameter ρ)
Model Training (Pre-training + Fine-tuning)
Inference & Detection (QA + Hallucination Detectors)

System Modules

Data Generator

Generate synthetic profiles with controlled spurious correlations

Model or implementation: Procedural generation scripts

Target Model

Learn facts and answer questions

Model or implementation: GPT-2-like architecture (various sizes) or SmolLM2-1.7B

Detector

Attempt to distinguish correct answers from hallucinations

Model or implementation: Various metrics (Perplexity, Entropy, Probing)

Novel Architectural Elements

Controlled injection of spurious correlations via parameter ρ during synthetic data construction to stress-test detection mechanisms

Modeling

Base Model: GPT-2-like models (Modded NanoGPT) and SmolLM2-1.7B

Training Method: Supervised Fine-Tuning (SFT) and Continual Pre-training

Adaptation: Full fine-tuning

Trainable Parameters: From 124M to 1.7B parameters

Training Data:

20,000 synthetic individuals profiles
Split: 10,000 Pre-training, 5,000 Fine-tuning, 5,000 Testing
Templates convert profiles to natural text

Key Hyperparameters:

correlation_coefficient_rho: Varied [0, 1]

Compute: Not reported in the paper

Comparison to Prior Work

vs. Confidence-based methods: This paper shows these fail because spurious correlations generate *high-confidence* errors
vs. Probing methods: Demonstrates that internal representations align with the spurious shortcut rather than truth when correlations are strong
vs. Refusal fine-tuning: Shows that models prefer the spurious answer over refusal even after specific safety training

Limitations

Primary controlled experiments rely on synthetic data (though validated on real models)
Real-world proxy (Jaccard similarity) is an approximation of spurious correlation, not a direct causal measure
Theoretical analysis uses a simplified kernel learning model rather than full Transformer dynamics

Reproducibility

Synthetic dataset generation methodology is described in detail (20k profiles, templates). Code availability is not explicitly provided in the text. Evaluation uses public models (GPT-OSS, Qwen, DeepSeek, GPT-5 API) and public benchmark (SimpleQA).

📊 Experiments & Results

Evaluation Setup

Synthetic QA task with controlled correlation strength, validated on real-world SimpleQA benchmark

Benchmarks:

Synthetic Biography QA (Fact Retrieval) [New]
SimpleQA (Open-domain QA)

Metrics:

Detection AUROC/Precision
QA Accuracy (Factual Recall)
Refusal Rate
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Performance of various hallucination detection methods (Precision) as a function of spurious correlation strength (ρ)

Impact of spurious correlation on Refusal Fine-tuning across model sizes (100M to 1B)

Real-world validation: Model confidence and self-consistency vs. Entity Co-occurrence (proxy for ρ) on SimpleQA

Main Takeaways

Hallucination detection methods (perplexity, entropy, probing) fail systematically as spurious correlation strength increases.
Spurious correlations create a 'confidence trap': models are highly confident in their hallucinations, rendering uncertainty-based defenses ineffective.
Refusal fine-tuning effectiveness degrades under strong spurious correlations; models fail to refuse unknown facts if a shortcut exists.
The phenomenon generalizes to state-of-the-art models (GPT-5, DeepSeek-V3), where higher entity co-occurrence leads to more confident and consistent hallucinations.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Hallucination in LLMs
Familiarity with Spurious Correlations (shortcuts)
Basic knowledge of Hallucination Detection methods (Perplexity, Entropy, Linear Probing)

Key Terms

spurious correlations: Statistical associations between variables that do not imply causation (e.g., surname predicting nationality) but are picked up by models as shortcuts

hallucination: Confident generation of incorrect or non-existent information by an LLM

refusal fine-tuning: Training a model to output a specific refusal token (e.g., 'I don't know') when it does not know the answer or is uncertain

linear probing: A method to inspect internal model representations by training a simple linear classifier on hidden states to predict truthfulness

logit entropy: A measure of uncertainty based on the probability distribution of the next token; high entropy usually implies uncertainty

self-consistency: A detection method that checks if a model generates the same answer across multiple sampling runs; high consistency usually implies higher confidence

kernel ridge regression: A statistical learning method used in the theoretical analysis to model how neural networks generalize versus memorize data

Jaccard similarity: A metric used here to measure the co-occurrence of entities in texts, serving as a proxy for the strength of spurious correlations in real-world data