← Back to Paper List

Language Models (Mostly) Know What They Know

(Anthropic) Saurav Kadavath, Tom Conerly, ... Jared Kaplan
Anthropic
arXiv, 7/2022 (2022)
Factuality QA RL Benchmark

📝 Paper Summary

Hallucination suppression Calibration
Large language models can be well-calibrated on specific formats and can learn to self-evaluate the correctness of their own answers, especially when shown multiple brainstormed samples.
Core Problem
Language models often lack honesty and the ability to accurately evaluate their own confidence, struggling to distinguish between what they know and what they hallucinate.
Why it matters:
  • Models that cannot identify their own knowledge gaps may confidently state falsehoods (hallucinations), making them unreliable for high-stakes tasks
  • Current calibration techniques often require finetuning or adjustments that may not generalize to open-ended generation
  • Understanding whether honesty generalizes from training distributions (like trivia) to other domains (like math or code) is crucial for building trustworthy AI
Concrete Example: When a model is unsure about a question like 'Who was the first president?', it might hallucinate 'Barack Obama' with high confidence. A calibrated model should assign low probability to this answer or state 'I don't know', but standard models often fail to express this uncertainty accurately.
Key Novelty
Self-Evaluation via P(True) and P(IK)
  • Self-Evaluation P(True): Ask the model to generate an answer, then ask it 'Is the proposed answer True or False?' to derive a validity probability
  • Brainstorming for Verification: Showing the model multiple of its own samples before asking it to judge one specific sample improves its ability to identify the correct one
  • P(IK) Training: Training a separate value head to predict 'Probability I Know' based on whether the model generates correct answers at unit temperature
Evaluation Highlights
  • 52B model achieves excellent calibration on BIG Bench multiple choice questions when options are clearly lettered
  • Self-evaluation accuracy improves significantly when conditioning on P(True) > 0.5 (e.g., GSM8k accuracy jumps from ~20% base to ~45% conditional)
  • P(IK) classifiers trained on TriviaQA generalize to distinguish known/unknown questions on Math and Lambada, though calibration transfers poorly
Breakthrough Assessment
7/10
Establish strong baselines for self-evaluation and calibration in large models. The 'brainstorming' finding for verification is significant. Limitations in OOD calibration prevent a higher score.
×