Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in NLG

📝 Paper Summary

Uncertainty Estimation Hallucination suppression

Semantic entropy estimates uncertainty in language models by clustering generations that share the same meaning using bidirectional entailment, rather than treating different phrasings of the same answer as distinct outcomes.

Core Problem

Standard predictive entropy measures uncertainty over specific token sequences, failing to account for 'semantic equivalence' where many different sentences mean the same thing.

Why it matters:

Models often output high entropy (uncertainty) simply because there are many ways to phrase the correct answer, not because the model doesn't know the answer
Reliable uncertainty measures are critical for safety in high-stakes applications like medical QA, allowing systems to abstain when unsure
Existing supervised methods require expensive human labels or fine-tuning, while current unsupervised methods ignore meaning entirely

Concrete Example: If a model assigns 0.5 probability to 'Paris' and 0.5 to 'It is Paris', standard entropy calculates high uncertainty (split between two outcomes). However, semantically, the model is 100% certain the answer is Paris. Semantic entropy correctly identifies this as low uncertainty.

Key Novelty

Semantic Entropy (SE)

Generates multiple answers from the model and clusters them based on meaning using a natural language inference (NLI) model to check bidirectional entailment (do they imply each other?)
Sum the probabilities of all sequences within a meaning-cluster to get the probability of the *meaning*, then compute entropy over these semantic clusters instead of raw token sequences

Evaluation Highlights

Semantic entropy outperforms standard predictive entropy and p(True) baselines on TriviaQA (closed-book) and CoQA (open-book) benchmarks
Performance gap widens as model size increases (up to 30B parameters) and as the number of samples increases
Achieves ~0.83 AUROC on TriviaQA with OPT-30B, significantly higher than lexical similarity (~0.79) or standard entropy (~0.76)

Breakthrough Assessment

8/10

Simple, effective, and unsupervised solution to a fundamental problem in NLG uncertainty (semantic equivalence). Strong empirical results without requiring model modification.

⚙️ Technical Details

Problem Definition

Setting: Uncertainty estimation for free-form Natural Language Generation (NLG) tasks, specifically Question Answering

Inputs: Context/Question x

Outputs: Uncertainty score (Predictive Entropy over semantic meanings)

Pipeline Flow

Generation (Sample M sequences from model)
Clustering (Group sequences by meaning via NLI)
Estimation (Sum probabilities per cluster, compute entropy)

System Modules

Generator

Produce M candidate answers for the given context

Model or implementation: OPT (2.7B, 6.7B, 13B, 30B)

Semantic Clusterer

Determine if pairs of generated sequences mean the same thing using bidirectional entailment

Model or implementation: Deberta-large-mnli

Entropy Estimator

Compute the semantic entropy over the clusters

Model or implementation: Mathematical calculation

Novel Architectural Elements

Integration of an external NLI model (Deberta) specifically to cluster the output space of a generative LLM for uncertainty estimation (Semantic clustering loop)

Modeling

Base Model: OPT (ranging from 2.7B to 30B parameters)

Compute: Inference only. Generator: OPT-30B requires significant VRAM. Clusterer: Deberta-large is relatively lightweight.

Comparison to Prior Work

vs. p(True): Does not require constructing specific prompts or asking the model to self-evaluate; handles multiple hypotheses naturally
vs. Length-normalised entropy: Aggregates probability mass by meaning, not just sequence form
vs. Lexical similarity: Uses NLI for true semantic equivalence rather than just word overlap
+ 1 more
vs. BS-Detector [not cited in paper]: Semantic entropy focuses on clustering entire answers via NLI, whereas BS-Detector often looks at token-level inconsistencies or uncertainty

Limitations

Relies on the accuracy of the NLI model (Deberta); if NLI fails, clustering is incorrect
Computationally more expensive than standard entropy due to NLI pair comparisons (O(M^2) in worst case, though optimized)
Does not account for 'unknown' unknowns (epistemic uncertainty) beyond what is captured by the predictive distribution

Reproducibility

Code: https://github.com/lorenzkuhn/semantic_uncertainty

publicly available (https://github.com/lorenzkuhn/semantic_uncertainty). Code and hand-labelled semantic equivalence dataset provided. Uses open source OPT models and Deberta-large.

📊 Experiments & Results

Evaluation Setup

Uncertainty correctness prediction: Can the uncertainty score distinguish between correct and incorrect answers?

Benchmarks:

TriviaQA (Closed-book Question Answering)
CoQA (Open-book Conversational Question Answering)

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TriviaQA	AUROC	0.79	0.83	+0.04
CoQA	AUROC	0.73	0.77	+0.04
TriviaQA	AUROC	0.68	0.83	+0.15
TriviaQA	AUROC	0.79	0.83	+0.04

Experiment Figures

AUROC performance on TriviaQA across different model sizes (2.7B to 30B)

Impact of sample count (left) and temperature (right) on AUROC

Main Takeaways

Semantic entropy consistently outperforms standard predictive entropy, p(True), and lexical similarity across datasets and model sizes
The method scales well: performance improves as the underlying language model size increases (from 2.7B to 30B)
Sampling temperature is critical; an intermediate temperature (e.g., 0.5) balances diversity and accuracy better than 1.0
Incorrect answers tend to have a higher number of semantically distinct clusters (3.89 vs 1.89 on TriviaQA), validating the core hypothesis

📚 Prerequisite Knowledge

Prerequisites

Information Theory (Entropy)
Natural Language Inference (Entailment)
Monte Carlo Integration

Key Terms

Semantic Equivalence: The property where two different sequences of text (e.g., 'Paris' and 'France's capital is Paris') share the same underlying meaning

Predictive Entropy: A measure of uncertainty calculating the information contained in the predictive distribution; higher entropy means the model is less sure

NLI: Natural Language Inference—a classification task determining whether one sentence (hypothesis) logically follows from another (premise)

Bidirectional Entailment: Two sentences are considered semantically equivalent if sentence A entails sentence B AND sentence B entails sentence A

Rouge-L: A metric measuring the longest common subsequence between two texts, often used for evaluating text generation quality

AUROC: Area Under the Receiver Operating Characteristic curve—a metric for binary classification (here, predicting if an answer is correct) where 0.5 is random and 1.0 is perfect

Monte Carlo Integration: A technique to estimate the value of an integral (here, the entropy) by averaging the results of random samples

OPT: Open Pre-trained Transformer—a series of open-source large language models similar to GPT-3