Language Models (Mostly) Know What They Know

📝 Paper Summary

Hallucination suppression Calibration

Large language models can be well-calibrated on specific formats and can learn to self-evaluate the correctness of their own answers, especially when shown multiple brainstormed samples.

Core Problem

Language models often lack honesty and the ability to accurately evaluate their own confidence, struggling to distinguish between what they know and what they hallucinate.

Why it matters:

Models that cannot identify their own knowledge gaps may confidently state falsehoods (hallucinations), making them unreliable for high-stakes tasks
Current calibration techniques often require finetuning or adjustments that may not generalize to open-ended generation
Understanding whether honesty generalizes from training distributions (like trivia) to other domains (like math or code) is crucial for building trustworthy AI

Concrete Example: When a model is unsure about a question like 'Who was the first president?', it might hallucinate 'Barack Obama' with high confidence. A calibrated model should assign low probability to this answer or state 'I don't know', but standard models often fail to express this uncertainty accurately.

Key Novelty

Self-Evaluation via P(True) and P(IK)

Self-Evaluation P(True): Ask the model to generate an answer, then ask it 'Is the proposed answer True or False?' to derive a validity probability
Brainstorming for Verification: Showing the model multiple of its own samples before asking it to judge one specific sample improves its ability to identify the correct one
P(IK) Training: Training a separate value head to predict 'Probability I Know' based on whether the model generates correct answers at unit temperature

Evaluation Highlights

52B model achieves excellent calibration on BIG Bench multiple choice questions when options are clearly lettered
Self-evaluation accuracy improves significantly when conditioning on P(True) > 0.5 (e.g., GSM8k accuracy jumps from ~20% base to ~45% conditional)
P(IK) classifiers trained on TriviaQA generalize to distinguish known/unknown questions on Math and Lambada, though calibration transfers poorly

Breakthrough Assessment

7/10

Establish strong baselines for self-evaluation and calibration in large models. The 'brainstorming' finding for verification is significant. Limitations in OOD calibration prevent a higher score.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the calibration and self-knowledge of Large Language Models on multiple choice and open-ended generation tasks

Inputs: Natural language questions (multiple choice or open-ended)

Outputs: Probabilities assigned to answer options (A/B/C...), P(True) for self-generated samples, or P(IK) scores

Pipeline Flow

Prompting/Sampling (Generate answers)
Self-Evaluation (Ask model to judge validity)
Aggregation/Selection (Filter based on confidence)

System Modules

Generator

Generate candidate answers for open-ended questions

Model or implementation: Anthropic LM (52B parameters)

Self-Evaluator (P(True)) (Evaluation)

Estimate probability that a specific generated sample is correct

Model or implementation: Anthropic LM (52B parameters)

P(IK) Head (Evaluation)

Predict if the model knows the answer before generating it

Model or implementation: Value head on top of LM

Novel Architectural Elements

P(IK) Value Head: A specialized head trained on the binary correctness of the model's own T=1 samples to predict 'knowability'
Self-Evaluation with Brainstorming: Conditioning the validity judgment of one sample on the context of several other generated samples

Modeling

Base Model: Anthropic Language Models (800M, 3B, 12B, 52B parameters)

Training Method: Finetuning with a value head for P(IK)

Objective Functions:

Purpose: Train P(IK) head to predict correctness.

Formally: Cross-entropy loss against soft labels derived from 30 T=1 samples (20 correct/10 incorrect becomes 20 (Q, IK) and 10 (Q, IDK) datapoints).

Training Data:

Generated 30 samples at T=1 for each question
Created dataset of (Few-Shot Prompt + Question, Ground Truth Label)
Ground Truth P(IK) defined as fraction of correct samples

Key Hyperparameters:

pretraining_tokens: 850B
sampling_temperature_eval: 1.0
sampling_samples_count: 30

Compute: Not reported in the paper

Comparison to Prior Work

vs. BIG Bench analysis: Extends to self-evaluation of open-ended generation, not just multiple choice
vs. Lin et al. (2022): Focuses on P(True) and P(IK) signals specifically, and investigates 'brainstorming' context effect
vs. Self-Consistency [not cited in paper]: Self-Consistency aggregates votes; this method asks the model to explicitly judge validity (P(True))

Limitations

P(IK) training generalizes poorly in terms of calibration to out-of-distribution tasks (e.g., training on TriviaQA and testing on Math)
Self-evaluation is harder than multiple choice; models are often overconfident in their own samples
'None of the above' options confuse the model and degrade calibration/accuracy
Analysis is limited to Anthropic's specific model family (up to 52B)

Reproducibility

Not provided: Code, model weights, and training datasets are not released. The paper uses internal Anthropic models. Prompts and formatting details are provided in the Appendix.

📊 Experiments & Results

Evaluation Setup

Evaluation of calibration and self-knowledge on multiple benchmarks

Benchmarks:

TriviaQA (Open-domain question answering)
MMLU (Multi-task multiple choice knowledge)
BIG Bench (Diverse few-shot tasks)
GSM8k (Math word problems (requires reasoning))
Lambada (Word prediction/completion)
Codex HumanEval (Python code generation)
TruthfulQA (Adversarial question answering)

Metrics:

Expected Calibration Error (ECE)
RMS Calibration Error
Brier Score
AUROC
Accuracy (Base vs Conditional)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results demonstrating calibration performance on multiple choice tasks.
BIG Bench (Multiple Choice)	ECE (Expected Calibration Error)	0.05	0.025	-0.025
Self-evaluation results showing the benefit of conditioning on high confidence (P(True)).
GSM8k	Accuracy	0.20	0.45	+0.25
Codex HumanEval	Accuracy	0.55	0.85	+0.30
Comparison of self-evaluation methods using Brier Score (lower is better).
Lambada	Brier Score	0.27	0.13	-0.14

Main Takeaways

Larger models are better calibrated on multiple choice questions, provided the format uses explicit lettered options (A, B, C)
Replacing an answer option with 'None of the above' significantly degrades both accuracy and calibration
Self-evaluation (P(True)) allows models to filter for much higher accuracy answers, effectively distinguishing correct from incorrect samples
Showing the model multiple 'brainstormed' samples (comparison examples) significantly improves self-evaluation Brier scores compared to showing a single sample
P(IK) (Probability I Know) training generalizes well in terms of ranking (AUROC) across tasks, but calibration fails to transfer out-of-distribution (models become under/over-confident)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Language Model sampling (temperature)
Concepts of calibration (ECE) and proper scoring rules (Brier score)
Basic knowledge of few-shot prompting

Key Terms

P(True): The probability a model assigns to the option 'True' when asked if a specific sample answer is correct

P(IK): Probability that 'I Know'—the probability a model assigns to the proposition that it will answer a given question correctly

Calibration: The alignment between a model's predicted probabilities and the actual frequency of correctness (e.g., if a model predicts 70% confidence, it should be right 70% of the time)

ECE: Expected Calibration Error—a metric summarizing calibration by averaging the absolute difference between predicted probabilities and actual accuracy across bins

AUROC: Area Under the Receiver Operating Characteristic—a metric measuring how well a model can distinguish between two classes (e.g., correct vs. incorrect answers) regardless of calibration threshold

RLHF: Reinforcement Learning from Human Feedback—a method to fine-tune language models using human preferences

Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

Brier Score: A proper scoring rule that measures the accuracy of probabilistic predictions; lower scores indicate better calibration and discrimination

OOD: Out-of-Distribution—evaluating a model on data types or tasks it was not explicitly trained on

Value Head: An additional neural network layer added to a language model to predict a scalar value (like P(IK)) rather than the next token

Unit temperature: Sampling with temperature T=1, meaning the model selects tokens based directly on their learned probabilities without sharpening or smoothing

None of the above: A multiple-choice option explicitly tested in the paper, which was found to degrade model performance and calibration