← Back to Paper List

Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, Yarin Gal
OATML, Department of Computer Science, University of Oxford
arXiv (2024)
Factuality QA Benchmark

📝 Paper Summary

Uncertainty Quantification Hallucination Detection
Semantic Entropy Probes (SEPs) are simple linear classifiers trained on LLM hidden states to predict the model's semantic uncertainty, enabling cheap hallucination detection without expensive multiple sampling.
Core Problem
Detecting hallucinations in LLMs reliably often requires sampling multiple generations to measure semantic uncertainty, which increases computational cost by 5-10x.
Why it matters:
  • High computational costs hinder the practical deployment of reliable uncertainty quantification methods like Semantic Entropy (SE) in real-world applications.
  • Existing probing methods rely on ground-truth accuracy labels, which are expensive to curate and may not generalize well to out-of-distribution tasks.
  • LLMs frequently fabricate facts (hallucinate), making them untrustworthy for high-stakes domains like medicine or law without reliable detection mechanisms.
Concrete Example: Given the prompt 'What is the capital of France?', a model might generate 'Paris' confidently. To detect uncertainty, Semantic Entropy requires generating 5-10 variations (e.g., 'Paris', 'Rome', 'Berlin'). This is slow. SEPs predict this uncertainty from a single hidden state during the first generation.
Key Novelty
Supervising hidden state probes with Semantic Entropy (SE) rather than Accuracy
  • Trains a linear probe (classifier) on the hidden states of a single generation to predict the Semantic Entropy score (uncertainty) calculated from multiple samples.
  • Eliminates the test-time cost of sampling multiple outputs; the probe acts as a proxy for the expensive sampling process.
  • Leverages the insight that model hidden states intrinsically encode semantic uncertainty, even before the full response is generated.
Evaluation Highlights
  • SEPs outperform accuracy probes on out-of-distribution generalization, achieving higher AUROC on held-out tasks (e.g., training on TriviaQA, testing on SQuAD/BioASQ).
  • Reduces the computational overhead of semantic uncertainty quantification to almost zero compared to the 5-10x cost of standard Semantic Entropy.
  • SEPs trained on the 'Token Before Generation' (last input token) perform competitively, suggesting uncertainty is encoded before the answer is even produced.
Breakthrough Assessment
7/10
Significantly improves the efficiency of uncertainty quantification. While performance doesn't beat the expensive sampling baseline, it offers a crucial speed/cost trade-off and generalizes better than standard accuracy probes.
×