Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

📝 Paper Summary

Uncertainty Quantification Hallucination Detection

Semantic Entropy Probes (SEPs) are simple linear classifiers trained on LLM hidden states to predict the model's semantic uncertainty, enabling cheap hallucination detection without expensive multiple sampling.

Core Problem

Detecting hallucinations in LLMs reliably often requires sampling multiple generations to measure semantic uncertainty, which increases computational cost by 5-10x.

Why it matters:

High computational costs hinder the practical deployment of reliable uncertainty quantification methods like Semantic Entropy (SE) in real-world applications.
Existing probing methods rely on ground-truth accuracy labels, which are expensive to curate and may not generalize well to out-of-distribution tasks.
LLMs frequently fabricate facts (hallucinate), making them untrustworthy for high-stakes domains like medicine or law without reliable detection mechanisms.

Concrete Example: Given the prompt 'What is the capital of France?', a model might generate 'Paris' confidently. To detect uncertainty, Semantic Entropy requires generating 5-10 variations (e.g., 'Paris', 'Rome', 'Berlin'). This is slow. SEPs predict this uncertainty from a single hidden state during the first generation.

Key Novelty

Supervising hidden state probes with Semantic Entropy (SE) rather than Accuracy

Trains a linear probe (classifier) on the hidden states of a single generation to predict the Semantic Entropy score (uncertainty) calculated from multiple samples.
Eliminates the test-time cost of sampling multiple outputs; the probe acts as a proxy for the expensive sampling process.
Leverages the insight that model hidden states intrinsically encode semantic uncertainty, even before the full response is generated.

Evaluation Highlights

SEPs outperform accuracy probes on out-of-distribution generalization, achieving higher AUROC on held-out tasks (e.g., training on TriviaQA, testing on SQuAD/BioASQ).
Reduces the computational overhead of semantic uncertainty quantification to almost zero compared to the 5-10x cost of standard Semantic Entropy.
SEPs trained on the 'Token Before Generation' (last input token) perform competitively, suggesting uncertainty is encoded before the answer is even produced.

Breakthrough Assessment

7/10

Significantly improves the efficiency of uncertainty quantification. While performance doesn't beat the expensive sampling baseline, it offers a crucial speed/cost trade-off and generalizes better than standard accuracy probes.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of model uncertainty/hallucination using internal representations

Inputs: Hidden state h from layer l at token position p for a single generation

Outputs: Predicted probability of high semantic entropy (binary label)

Pipeline Flow

Input Query → LLM Forward Pass
Hidden State Extraction
SEP Classification (Inference)

System Modules

LLM Forward Pass

Process input query and generate response (greedy decoding)

Model or implementation: Llama-2 (7B/70B), Mistral 7B, Phi-3 Mini, or Llama-3-70B

Hidden State Extraction

Select specific hidden state vector for probing

Model or implementation: Selector

SEP Classifier

Predict if semantic entropy is high or low

Model or implementation: Linear Logistic Regression

Novel Architectural Elements

Integration of a lightweight linear probe (SEP) operating on single-pass hidden states specifically to predict multi-sample semantic uncertainty metrics.

Modeling

Base Model: Llama-2-7B, Llama-2-70B, Mistral-7B, Phi-3-Mini, Llama-3-70B

Training Method: Logistic Regression Training (Probing)

Objective Functions:

Purpose: Optimize the threshold for binarizing continuous Semantic Entropy scores into labels.

Formally: Maximizing information gain (similar to regression trees).
Purpose: Train the linear probe to classify hidden states.

Formally: Logistic Regression objective.

Adaptation: Linear Probe (Logistic Regression)

Training Data:

Inputs: Hidden states from QA datasets (TriviaQA, SQuAD, BioASQ, NQ Open)
Labels: Binarized Semantic Entropy (SE) scores derived from 10 sampled generations per input
SE calculation uses DeBERTa-Large or GPT-3.5 for entailment

Key Hyperparameters:

number_of_samples_for_SE_ground_truth: 10
sampling_temperature: 1
probing_layers: All layers evaluated
+ 1 more
token_positions: TBG (Token Before Generation), SLT (Second Last Token)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Semantic Entropy (SE): SEP is 5-10x cheaper (single forward pass vs. multiple samples) but approximates the value.
vs. Accuracy Probes: SEP is trained on unsupervised uncertainty (SE) rather than supervised correctness labels; SEP generalizes better to OOD tasks.
vs. Naive Entropy: SEP captures semantic meaning, avoiding conflation of lexical variety with semantic uncertainty.

Limitations

Cannot match the absolute performance of the expensive sampling-based Semantic Entropy method (the 'teacher').
Relies on the quality of the Semantic Entropy estimation itself, which depends on the NLI model (DeBERTa/GPT-3.5).
Binary classification of uncertainty loses fine-grained information compared to raw SE scores.
Generalization is better than accuracy probes but performance still drops on difficult out-of-distribution tasks.

Reproducibility

Code availability is not provided in the paper text. The method relies on standard QA datasets and open-source models (Llama, Mistral). The process requires generating ground-truth SE labels which involves sampling 10 generations per query.

📊 Experiments & Results

Evaluation Setup

Hallucination detection on short-form and long-form QA tasks using hidden state probes.

Benchmarks:

TriviaQA (Short-form QA)
SQuAD (Short-form QA)
BioASQ (Short-form QA (Biomedical))
NQ Open (Short-form QA)

Metrics:

AUROC (Area Under Receiver Operating Characteristic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generalization experiments where probes are trained on TriviaQA and evaluated on other datasets (SQuAD, BioASQ, NQ Open). SEPs generally outperform accuracy probes in OOD settings.
SQuAD (Transfer from TriviaQA)	AUROC	0.58	0.68	+0.10
BioASQ (Transfer from TriviaQA)	AUROC	0.63	0.66	+0.03
NQ Open (Transfer from TriviaQA)	AUROC	0.68	0.74	+0.06
Comparison against the expensive sampling-based Semantic Entropy (the 'teacher' signal) on Llama-2-7B.
SQuAD (Transfer from TriviaQA)	AUROC	0.78	0.68	-0.10

Experiment Figures

Comparison of AUROC scores for different hallucination detection methods (Naive Entropy, Accuracy Probes, SEPs, Semantic Entropy) across multiple datasets, highlighting the OOD generalization gap.

Main Takeaways

Model hidden states encode semantic uncertainty even before generation begins (Token Before Generation probes perform well).
Probes trained to predict Semantic Entropy (SEPs) generalize better to new tasks than probes trained on correctness (accuracy probes).
Middle-to-late layers of the LLM generally contain the most information regarding semantic uncertainty.
SEPs offer a massive computational advantage (zero marginal cost at test time) while retaining a significant portion of the performance of sampling-based methods.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) architecture
Linear Probing / Linear Classifiers
Uncertainty Quantification / Entropy
Natural Language Inference (NLI) for semantic equivalence

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Semantic Entropy (SE): An uncertainty measure that clusters multiple model generations by meaning (using NLI) and calculates entropy over these semantic clusters

SEP: Semantic Entropy Probe—a linear classifier trained on LLM hidden states to predict the Semantic Entropy value

Linear Probe: A simple linear classifier (e.g., logistic regression) trained on the fixed features (hidden states) of a pre-trained model

NLI: Natural Language Inference—a task determining if one text entails (logically implies) another

DeBERTa: Decoding-enhanced BERT with disentangled attention—a transformer model often used for NLI tasks

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for binary classification problems

TBG: Token Before Generation—the hidden state at the last token of the input query

SLT: Second Last Token—the hidden state at the last token of the model response (before EOS)

Hallucination: Plausible-sounding but factually incorrect or arbitrary generation by an LLM