Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

📝 Paper Summary

Hallucination suppression Uncertainty Quantification (UQ)

The paper proposes a novel uncertainty quantification method, Claim-Conditioned Probability (CCP), which detects factual errors in LLM outputs by measuring uncertainty about the specific claim while ignoring irrelevant uncertainty about wording or claim order.

Core Problem

LLMs frequently hallucinate convincing but false claims, and existing fact-checking methods are either computationally expensive (requiring external knowledge/models) or imprecise because they conflate uncertainty about facts with uncertainty about wording/style.

Why it matters:

Hallucinations are dangerous because occasional falsehoods are obscured by mostly correct text, making them hard for users to spot
Standard uncertainty metrics (like entropy) are noisy because high uncertainty can result from harmless choices (e.g., synonyms) rather than factual ignorance
Reliance on external databases for verification introduces latency, storage overhead, and issues with incomplete knowledge sources

Concrete Example: When generating a biography, a model might be uncertain whether to say 'studied art' or 'studied painting' (harmless surface form uncertainty), or it might be uncertain whether the person studied 'art' or 'physics' (factual claim uncertainty). Standard entropy treats both as high uncertainty, flagging correct text as unreliable, whereas the proposed method distinguishes them.

Key Novelty

Claim-Conditioned Probability (CCP)

Quantifies uncertainty by checking if high-probability alternative tokens change the *meaning* of the sentence, rather than just the wording
Uses a lightweight Natural Language Inference (NLI) model to compare the original generation against versions where specific words are swapped with their top alternatives
Isolates 'Claim Uncertainty' (factual errors) by explicitly removing 'Surface Form Uncertainty' (synonyms) and 'Claim Order Uncertainty' (reordering of true facts)

Architecture

The complete fact-checking pipeline using token-level uncertainty quantification

Evaluation Highlights

CCP achieves highest AUC-ROC (0.81 with Llama-2-7b-chat) for detecting hallucinations in biographies, outperforming P(True) and Neg Perplexity baselines
Consistent performance across 4 languages (English, Chinese, Arabic, Russian) and 7 LLMs, often surpassing Maximum Probability by margins of ~0.05-0.10 AUC
Human evaluation confirms CCP-based fact-checking is competitive with FactScore (which uses external Wikipedia retrieval), achieving similar precision without external access

Breakthrough Assessment

7/10

Strong methodological contribution in distinguishing types of uncertainty for white-box models. While it relies on an NLI model, it removes the need for external knowledge bases, offering a self-contained solution for hallucination detection.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc uncertainty quantification for autoregressive LLM generations to detect factual errors (hallucinations)

Inputs: An input prompt x and a generated response y from a white-box LLM

Outputs: A scalar uncertainty score for each atomic claim in y, indicating the likelihood of hallucination

Pipeline Flow

Text Generation (LLM generates biography)
Claim Extraction (GPT-3.5 splits text into atomic claims)
Token-level UQ (Calculate CCP using NLI on token alternatives)
Aggregation (Combine token scores into claim scores)
Thresholding (Flag claims as unreliable if uncertainty > threshold)

System Modules

Generator

Generate the initial response (biography)

Model or implementation: Various (Llama-2, BLOOM, GLM, etc.)

Claim Extractor

Split generated text into atomic claims

Model or implementation: gpt-3.5-turbo-0613

NLI Evaluator (Uncertainty Estimation)

Determine if token alternatives change the meaning of the claim (Entail/Contradict/Neutral)

Model or implementation: cross-encoder/nli-deberta-v3-large (English), various for other languages

Scorer (Uncertainty Estimation)

Calculate Claim-Conditioned Probability (CCP)

Model or implementation: Algorithm (Eq 7 in paper)

Novel Architectural Elements

Integration of NLI-based equivalence checking directly into the token probability calculation to filter semantic uncertainty from lexical uncertainty

Modeling

Base Model: Evaluated on: Llama-2-7b-chat, Llama-2-13b-chat, Llama-2-70b-chat, BLOOM-7b1, BLOOMZ-7b1-mt, GLM-10b-chinese, ruGPT-3.5-13B

Comparison to Prior Work

vs. FactScore: CCP requires no external knowledge base or retrieval step
vs. P(True): CCP does not require a second inference pass of the full model (though it uses a smaller NLI model)
vs. Semantic Entropy: CCP operates at the token level using local alternatives rather than sampling full sequence generations, and specifically targets claim-level factuality by removing claim-type uncertainty [not cited in paper as direct baseline, but methodologically distinct]

Limitations

Relies on the quality of the NLI model; poor NLI performance (especially in non-English languages) degrades CCP accuracy
Computational overhead from running NLI on multiple token alternatives for every word in the claim
Requires white-box access to the language model's probability distribution (inapplicable to API-only models like GPT-4)
Function words are assigned a heuristic CCP of 1, which might miss subtle errors in prepositions or determiners

Reproducibility

Code: https://github.com/IINemo/lm-polygraph

Code and data publicly available at https://github.com/IINemo/lm-polygraph. Uses specific NLI models for different languages (DeBERTa-v3-large for English, xlm-roberta-large-xnli for Arabic/Russian). Requires white-box access to LLM logits.

📊 Experiments & Results

Evaluation Setup

Fact-checking biographies generated by LLMs across 4 languages (English, Chinese, Arabic, Russian)

Benchmarks:

FactScore-Bio (Biography Generation and Fact Verification) [New]

Metrics:

AUC-ROC (Area Under Receiver Operating Characteristic)
Pearson Correlation (between uncertainty score and error rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FactScore-Bio (English)	AUC-ROC	0.76	0.81	+0.05
FactScore-Bio (English)	AUC-ROC	0.73	0.79	+0.06
FactScore-Bio (Russian)	AUC-ROC	0.68	0.72	+0.04
FactScore-Bio (Chinese)	AUC-ROC	0.65	0.69	+0.04
FactScore-Bio (Arabic)	AUC-ROC	0.66	0.73	+0.07

Experiment Figures

A conceptual example of how CCP is calculated using NLI

Main Takeaways

CCP consistently outperforms baseline uncertainty measures (Max Prob, Entropy, P(True)) across all tested models and languages for detecting hallucinations.
The method is particularly effective because it filters out 'noise' from synonym selection (surface form uncertainty), focusing only on factual uncertainty.
Human evaluation suggests CCP is a viable alternative to external knowledge-based checkers like FactScore, providing similar reliability without the retrieval overhead.
Performance is robust across varying model sizes (7B to 70B) and languages, though dependent on the availability of a decent NLI model for the target language.

📚 Prerequisite Knowledge

Prerequisites

Autoregressive language modeling
Natural Language Inference (NLI)
Information-theoretic uncertainty measures (Entropy, Perplexity)

Key Terms

Uncertainty Quantification (UQ): Methods to estimate how confident a model is in its own predictions, often used to predict correctness

CCP: Claim-Conditioned Probability—the proposed metric that measures the probability of a claim's meaning given the context, marginalizing over surface forms

NLI: Natural Language Inference—a task determining if one sentence entails, contradicts, or is neutral towards another

FactScore: An automatic evaluation metric that breaks text into atomic claims and verifies them against a knowledge base (like Wikipedia)

White-box model: A model where internal parameters and token probability distributions are accessible (unlike API-only black-box models)

Beam search: A decoding algorithm that explores multiple likely paths of token generation to find the most probable sequence

AUC-ROC: Area Under the Receiver Operating Characteristic Curve—a performance metric for classification tasks at various threshold settings

Surface form uncertainty: Uncertainty regarding which specific word (e.g., synonym) to use to express a concept, which does not affect factual correctness

Atomic claim: A simple, indivisible statement of fact extracted from a longer text (e.g., 'He was born in 1990')