← Back to Paper List

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in NLG

(Oxford) Lorenz Kuhn, Yarin Gal, Sebastian Farquhar
OATML Group, Department of Computer Science, University of Oxford
ICLR (2023)
Factuality QA Benchmark

📝 Paper Summary

Uncertainty Estimation Hallucination suppression
Semantic entropy estimates uncertainty in language models by clustering generations that share the same meaning using bidirectional entailment, rather than treating different phrasings of the same answer as distinct outcomes.
Core Problem
Standard predictive entropy measures uncertainty over specific token sequences, failing to account for 'semantic equivalence' where many different sentences mean the same thing.
Why it matters:
  • Models often output high entropy (uncertainty) simply because there are many ways to phrase the correct answer, not because the model doesn't know the answer
  • Reliable uncertainty measures are critical for safety in high-stakes applications like medical QA, allowing systems to abstain when unsure
  • Existing supervised methods require expensive human labels or fine-tuning, while current unsupervised methods ignore meaning entirely
Concrete Example: If a model assigns 0.5 probability to 'Paris' and 0.5 to 'It is Paris', standard entropy calculates high uncertainty (split between two outcomes). However, semantically, the model is 100% certain the answer is Paris. Semantic entropy correctly identifies this as low uncertainty.
Key Novelty
Semantic Entropy (SE)
  • Generates multiple answers from the model and clusters them based on meaning using a natural language inference (NLI) model to check bidirectional entailment (do they imply each other?)
  • Sum the probabilities of all sequences within a meaning-cluster to get the probability of the *meaning*, then compute entropy over these semantic clusters instead of raw token sequences
Evaluation Highlights
  • Semantic entropy outperforms standard predictive entropy and p(True) baselines on TriviaQA (closed-book) and CoQA (open-book) benchmarks
  • Performance gap widens as model size increases (up to 30B parameters) and as the number of samples increases
  • Achieves ~0.83 AUROC on TriviaQA with OPT-30B, significantly higher than lexical similarity (~0.79) or standard entropy (~0.76)
Breakthrough Assessment
8/10
Simple, effective, and unsupervised solution to a fundamental problem in NLG uncertainty (semantic equivalence). Strong empirical results without requiring model modification.
×