Teaching models to express their uncertainty in words

📝 Paper Summary

Hallucination suppression Calibration of language models

GPT-3 can be fine-tuned to express calibrated uncertainty about its own answers using natural language (verbalized probability), generalizing reasonably well to new tasks without relying solely on output logits.

Core Problem

Language models often hallucinate false statements with high confidence, and their internal token-level probabilities (logits) represent uncertainty over text strings rather than epistemic uncertainty about the truth of claims.

Why it matters:

Users need to know when to trust model outputs, especially for economic forecasts or scientific problems where ground truth is unknown
Standard logit-based calibration fails when a claim can be paraphrased many ways (uncertainty over tokens vs. uncertainty over claims)
Pre-trained models like GPT-3 are often uncalibrated and overconfident on tasks outside their training distribution

Concrete Example: If a model is asked 'What is 952 - 55?', it might answer '897' (correct) but assign low token probability because there are many ways to write the answer, or answer '900' (incorrect) with high confidence. A calibrated model should output '897' and explicitly state 'Confidence: 61%' or 'Medium', matching the actual probability of correctness.

Key Novelty

Verbalized Probability via Supervised Fine-Tuning

Train the model to output its confidence in natural language (e.g., '90%' or 'High confidence') alongside its answer, treating confidence as a text generation task
Use the model's own empirical accuracy on specific question types as the ground-truth label for training, allowing it to learn its own epistemic uncertainty

Architecture

The inference workflow for verbalized probability calibration

Evaluation Highlights

Verbalized probability achieves 16.4% MAD (Mean Absolute Deviation) on the out-of-distribution Multi-answer evaluation set, significantly better than the answer logit baseline (33.7% MAD)
Verbalized probability reaches 10.6% MAD on the Multiply-divide evaluation set, generalizing from the simpler Add-subtract training set
Stochastic 50-shot learning achieves calibration performance close to the fully fine-tuned model (roughly matching the calibration curve visually in Figure 6)

Breakthrough Assessment

8/10

First demonstration of a model expressing calibrated epistemic uncertainty in natural language that generalizes to new distributions. Shifts calibration paradigm from logits to language.

⚙️ Technical Details

Problem Definition

Setting: Calibrating a language model M such that its assigned probability p_M reflects the true probability of its answer a_M being correct: Pr(a_M is correct | p_M = p) = p.

Inputs: Arithmetic question q (e.g., 'What is 23 divided by 4?')

Outputs: Answer a_M and a confidence statement (e.g., 'Confidence: Medium' or '61%')

Pipeline Flow

Input Question q
Model generates Answer a_M (Greedy Decoding)
Model generates Confidence c_M (Verbalized token or number)

System Modules

GPT-3 (Davinci)

Generate both the arithmetic answer and the verbalized probability token

Model or implementation: GPT-3 175B (Davinci)

Novel Architectural Elements

Verbalized probability output: Treating calibration as a text generation task where the model outputs its own confidence level as a token sequence

Modeling

Base Model: GPT-3 175B (Davinci)

Training Method: Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Train model to output confidence matching empirical accuracy.

Formally: Minimize cross-entropy on dataset where label is floor(100 * empirical_accuracy) or mapped word bucket.

Adaptation: Full fine-tuning

Trainable Parameters: 175B (assumed full fine-tuning via OpenAI API)

Training Data:

Training set: 'Add-subtract' (10k examples)
Data generation: Sample 100 questions per sub-task, generate zero-shot answers, calculate empirical accuracy (p_hat) for that sub-task
Label assignment: Label = floor(100 * p_hat) for numbers, or mapped to 5 bins (Lowest to Highest) for words

Key Hyperparameters:

model_size: 175B parameters
batch_size: Not reported in the paper
learning_rate: Not reported in the paper

Compute: Uses OpenAI API; specific compute resources not reported

Comparison to Prior Work

vs. Answer Logit: Verbalized probability handles multiple valid answers better (e.g., 'Name a prime < 10') where specific token probability is low but answer correctness is high
vs. Indirect Logit: Verbalized probability generalizes better to the 'Multi-answer' distribution shift (16.4% vs 38.4% MAD)
vs. Logistic Regression: Verbalized probability outperforms heuristics, suggesting reliance on latent internal features
+ 1 more
vs. Mukhoti et al. (2020) [not cited in paper]: Focuses on generative verbalization rather than focal loss scaling for calibration

Limitations

Relies on Supervised Fine-Tuning with empirical accuracy labels, which might be suboptimal compared to Reinforcement Learning with proper scoring rules
Model is underconfident on easy out-of-distribution tasks (overfits to the harder training distribution)
Evaluation is limited to arithmetic tasks (CalibratedMath); generalization to other domains (history, biology) is untested
Requires a model capable of high-performance arithmetic (GPT-3 175B) to be a meaningful test; smaller models fail completely

Reproducibility

Code availability is not provided. The dataset 'CalibratedMath' is described in detail (Table 3) but no explicit download link is provided in the text. Relies on OpenAI API (GPT-3 Davinci) which is a closed model.

📊 Experiments & Results

Evaluation Setup

Zero-shot answer generation followed by confidence estimation. Models trained on 'Add-subtract' tasks and evaluated on 'Multi-answer' and 'Multiply-divide' tasks to test distribution shift.

Benchmarks:

CalibratedMath (Arithmetic questions with integer answers) [New]

Metrics:

Mean Squared Error (MSE) / Brier Score
Mean Absolute Deviation (MAD)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results demonstrating calibration generalization under distribution shift (Training: Add-subtract -> Evaluation: Multi-answer).
CalibratedMath (Multi-answer)	MAD	33.7	16.4	-17.3
CalibratedMath (Multi-answer)	MSE	37.4	22.0	-15.4
CalibratedMath (Multi-answer)	MAD	38.4	16.4	-22.0
Results on the Multiply-divide evaluation set (harder tasks, single answer).
CalibratedMath (Multiply-divide)	MAD	9.4	19.0	+9.6
CalibratedMath (Multi-answer)	MSE	29.7	29.0	-0.7
CalibratedMath (Multiply-divide)	MSE	14.0	12.7	-1.3

Experiment Figures

Calibration curves (Reliability Diagrams) for Training and Evaluation sets comparing Verbalized probability, Indirect logit, and Answer logit

2D projection of GPT-3 embeddings for question-answer pairs in the Multiply-divide set

Main Takeaways

Verbalized probability enables models to express uncertainty about claims (epistemic) rather than tokens, handling tasks with multiple valid answers where logit-based methods fail
The model generalizes calibration to out-of-distribution tasks (different operations, difficulty levels), though it tends to be underconfident when the target task is easier than the training task
Calibration performance is likely driven by pre-trained latent representations of truth/difficulty rather than learning simple surface heuristics or mimicking human behaviors
Few-shot prompting (k=50) achieves calibration performance competitive with full fine-tuning, further evidence that the capability is latent

📚 Prerequisite Knowledge

Prerequisites

Understanding of language model logits and token probabilities
Basic probability calibration concepts (Brier score/MSE, reliability diagrams)
Supervised fine-tuning of Large Language Models (LLMs)

Key Terms

verbalized probability: Uncertainty expressed by the model in natural language words or numbers (e.g., 'High confidence', '90%') rather than extracted from internal weights

logits: The raw, unnormalized scores output by the last layer of a neural network before applying softmax to get probabilities

epistemic uncertainty: Uncertainty about the truth of a claim or knowledge itself, rather than uncertainty about which word comes next (aleatoric/token uncertainty)

CalibratedMath: A suite of arithmetic tasks introduced in this paper to test calibration generalization across different difficulty levels and question formats

MAD: Mean Absolute Deviation—a metric measuring the average absolute difference between the model's predicted confidence and its actual accuracy within binned groups

MSE: Mean Squared Error—a proper scoring rule (equivalent to Brier Score) measuring the squared difference between predicted probability and the binary ground truth (correct/incorrect)

greedy decoding: A generation strategy where the model always selects the highest-probability token at each step

Expected Value decoding: A decoding method used in few-shot experiments where the output is a weighted sum of the top-5 token values, used to get a continuous confidence estimate

answer logit: The normalized log-probability assigned by the model to the tokens constituting its answer

indirect logit: The log-probability of the token 'True' when the model is asked to evaluate 'True/false' on its own generated answer

linear probe: A simple linear classifier trained on top of a frozen model's internal representations (embeddings) to predict a target label