A Head to Predict and a Head to Question: Pre-trained Uncertainty Quantification Heads for Hallucination Detection in LLM Outputs

📝 Paper Summary

Uncertainty Quantification (UQ) Hallucination Detection

The paper introduces pre-trained Transformer-based Uncertainty Quantification (UQ) heads that attach to frozen LLMs, using attention maps and token probabilities to detect claim-level hallucinations more effectively than unsupervised methods.

Core Problem

LLMs frequently hallucinate convincing but false information, and users lack tools to detect these errors. Existing UQ methods are either unsupervised (weak performance) or supervised but rely on outdated architectures (linear probes/MLPs) and limited features.

Why it matters:

Hallucinations undermine trust in LLM applications, risking the spread of misleading information to users
Unsupervised methods struggle with the infinite nature of text generation and token interdependencies
Previous supervised attempts (SAPLMA, Factoscope) often fail to generalize across domains or require complex, inefficient feature engineering

Concrete Example: When an LLM generates a biography, it might hallucinate a specific date or award. While the token probability for the year might be slightly low, unsupervised methods often miss this context. The proposed UQ head, by analyzing attention patterns (e.g., the model attending to its own generations rather than the prompt), can flag the specific claim 'won the award in 1999' as uncertain.

Key Novelty

Transformer-based Uncertainty Quantification (UQ) Heads

Attaches a lightweight Transformer encoder to a frozen LLM to process internal states during generation without retraining the LLM itself
Leverages a specific combination of features: flattened attention maps (capturing how the model attends to prompt vs. generation) and top-token probabilities
Operates at the sub-sentence 'atomic claim' level rather than just flagging entire sequences

Architecture

The architecture of the Uncertainty Quantification (UQ) head attached to the LLM.

Evaluation Highlights

Achieves state-of-the-art performance in claim-level hallucination detection, outperforming Factoscope and LookbackLens across in-domain and out-of-domain prompts
Demonstrates strong cross-lingual generalization to languages not seen during training
Releases a collection of pre-trained heads for Llama, Gemma 2, and Mistral-v0.2 series models

Breakthrough Assessment

8/10

Significant practical contribution by releasing plug-and-play UQ heads that outperform existing methods. The shift to Transformer-based heads with attention features addresses key limitations of linear probes.

⚙️ Technical Details

Problem Definition

Setting: Claim-level uncertainty quantification for text generation

Inputs: Prompt x, generated tokens y, and a specific atomic claim c_i (subset of y)

Outputs: Uncertainty score U(c_i | x, y) in [0, 1] indicating likelihood of hallucination

Pipeline Flow

Feature Extraction (Attention Maps + Token Probabilities)
Feature Reduction & Embedding
Transformer Encoding
Classification

System Modules

Feature Extractor

Extracts raw internal signals from the frozen LLM for each generated token

Model or implementation: Deterministic extraction function

Reduction Network

Reduces dimensionality of raw features and adds claim-specific embeddings

Model or implementation: Linear layers + GELU + Dropout

Transformer Encoder

Processes the sequence of token features to capture dependencies

Model or implementation: Multi-layer Transformer Encoder

Classifier

Predicts the probability that the claim is a hallucination

Model or implementation: Two-layer classification MLP

Novel Architectural Elements

Use of a Transformer encoder as the backbone for the UQ head (replacing MLP/linear probes)
Specific feature combination: Flattened raw attention maps (previous k tokens) concatenated with top-m log-probabilities, avoiding aggregation loss seen in LookbackLens

Modeling

Base Model: UQ Heads trained for Mistral-v0.2, Llama series, and Gemma 2

Training Method: Supervised training of the auxiliary UQ head on labeled hallucination data

Objective Functions:

Purpose: Minimize classification error for hallucination detection.

Formally: Binary Cross-Entropy loss.

Adaptation: Auxiliary head training (LLM is frozen)

Trainable Parameters: Parameters of the UQ head (Reduction Network, Transformer Encoder, Classifier)

Training Data:

Automatic pipeline for annotation of hallucinations in LLM outputs to create large-scale training data

Key Hyperparameters:

attention_window_k: 2 to 5 (empirically found optimal)
dropout: Used in all components
activation: GELU

Compute: Small memory and computational footprint compared to the base LLM

Comparison to Prior Work

vs. SAPLMA/Factoscope: Uses Transformer architecture instead of MLP/Linear; relies on attention maps rather than just hidden states
vs. LookbackLens: Uses raw flattened attention maps and a Transformer encoder rather than aggregated ratios and linear regression, preserving more information
vs. Unsupervised: Supervised training on native LLM responses allows learning complex patterns of uncertainty
+ 1 more
vs. CH-Wang et al. (2024): Focuses on atomic claims rather than spans; uses Transformer backbone instead of CNN/GRU

Limitations

Requires annotated hallucination data for training (though an automatic pipeline is used)
Inference cost is slightly increased due to the additional UQ head (though claimed to be small)
Attention features increase input dimension size, though limited by small k window

Reproducibility

Code: https://github.com/IINemo/llm-uncertainty-head

Code and pre-trained heads are publicly released at https://github.com/IINemo/llm-uncertainty-head. The paper details the feature extraction method and architecture. Data generation pipeline is mentioned as automatic.

📊 Experiments & Results

Evaluation Setup

Claim-level hallucination detection on LLM outputs

Benchmarks:

In-domain validation (Hallucination detection on same domain as training) [New]
Out-of-domain validation (Hallucination detection on unseen domains/prompts) [New]
Cross-lingual validation (Generalization to unseen languages) [New]

Metrics:

AUROC (Area Under the Receiver Operating Characteristic curve)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper qualitatively claims state-of-the-art performance but the specific numeric tables are not provided in the extracted text. The text explicitly states 'achieve state-of-the-art performance... outperforming other supervised and unsupervised techniques'.

Main Takeaways

Transformer-based UQ heads significantly outperform unsupervised baselines and previous supervised methods (like SAPLMA and Factoscope) in detecting hallucinations.
Attention maps are identified as the most informative feature source for uncertainty, more so than hidden states which may overfit to domain.
The optimal lookback window for attention features is small (k=2 to 5), suggesting local context is sufficient for the Transformer head to extract uncertainty patterns.
The method generalizes well to out-of-domain prompts and even to languages not seen during training.
An automatic pipeline for generating training data from native LLM responses is effective for scaling up supervision.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (attention mechanisms)
Large Language Model internals (hidden states, logits, attention maps)
Binary classification metrics (AUROC)

Key Terms

UQ head: A small auxiliary neural network trained to predict uncertainty/hallucinations based on the internal states of a larger, frozen LLM

Atomic claim: A sub-sentence unit of information (e.g., a single fact like 'born in 1990') extracted from a longer generation

Lookback ratio: A feature measuring the ratio of attention weights assigned to the prompt tokens versus previously generated tokens

Unembedding matrix: The final linear layer of an LLM that projects hidden states back to the vocabulary size to predict token probabilities

GELU: Gaussian Error Linear Unit—an activation function used in the UQ head's neural network layers