(SAR) Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form LLMs

📝 Paper Summary

Uncertainty Quantification (UQ) Hallucination Detection

SAR improves uncertainty estimation in LLMs by down-weighting linguistic redundancy (irrelevant tokens like 'the', 'of') and up-weighting semantically relevant components at both token and sentence levels.

Core Problem

Standard uncertainty metrics (like Predictive Entropy) treat all tokens equally, even though 'irrelevant' tokens (e.g., articles, prepositions) often dominate the uncertainty calculation despite having little semantic meaning.

Why it matters:

High uncertainty on irrelevant tokens (e.g., 'of' in 'density of an object') can falsely trigger refusal or low confidence scores even when the model knows the core answer
Existing methods underestimate 'generative inequality'—the fact that a few keywords convey the essence of a long sentence while linguistic redundancy dilutes uncertainty measurements
Accurate UQ is critical for high-stakes Human-AI interaction (e.g., medical Q&A) where users must know when to trust free-form model outputs

Concrete Example: For the question 'What is the ratio of mass to volume?', the model generates 'density of an object'. The token 'of' might have high entropy (uncertainty), misleading the total score to suggest the model is uncertain, even though the core token 'density' is correct and confident.

Key Novelty

Shifting Attention to Relevance (SAR)

Token-level shifting: Calculate how much the meaning changes if a token is removed; use this 'relevance score' to re-weight the entropy contribution of each token
Sentence-level shifting: Reduce the uncertainty estimate for sentences that are semantically similar to other generated samples, assuming consistency implies correctness
Joint optimization: Combine both shifting strategies to focus uncertainty quantification on the semantically loaded parts of the generation

Architecture

Comparison between standard Predictive Entropy and SAR on a specific example ('density of an object'). Shows how standard UQ adds high uncertainty from the token 'of', while SAR suppresses it.

Evaluation Highlights

+11.9% AUROC improvement on TriviaQA using Vicuna-13b compared to Semantic Entropy (SE) baseline
+7.1% average AUROC improvement over Semantic Entropy across multiple datasets and models (Vicuna, WizardLM, LLaMA-2-chat)
Achieves superior performance with only 5 generations compared to baselines requiring more samples, demonstrating higher generation efficiency

Breakthrough Assessment

7/10

Simple yet effective heuristic that addresses a fundamental flaw in how token-level entropy is aggregated. Significant empirical gains, though relies on auxiliary models for similarity.

⚙️ Technical Details

Problem Definition

Setting: Uncertainty quantification for free-form auto-regressive Large Language Models

Inputs: Input prompt x and a set of generated sentences S = {s1, ..., sK}

Outputs: A single scalar uncertainty score for the generation

Pipeline Flow

Generation (Generate K sentences)
Relevance Calculation (Token & Sentence level)
Attention Shifting (Re-weighting)
Uncertainty Aggregation

System Modules

Generator

Generate candidate answers

Model or implementation: Target LLM (e.g., Vicuna, LLaMA-2-chat)

Token Relevance Scorer (Relevance Calculation)

Measure semantic importance of each token

Model or implementation: Cross-Encoder (RoBERTa-large)

Sentence Relevance Scorer (Relevance Calculation)

Measure semantic consistency of a sentence with others

Model or implementation: Sentence Similarity Model (DistilRoBERTa)

Uncertainty Aggregator

Compute final uncertainty score using shifted attention

Model or implementation: Mathematical Formula (SAR)

Novel Architectural Elements

Relevance-weighted entropy formulation: Modifies standard entropy calculation by multiplying log-probs with normalized semantic relevance scores
Soft-consistency shifting: Uses continuous semantic similarity to adjust sentence probability, unlike hard clustering in Semantic Entropy

Modeling

Base Model: Evaluated on OPT (2.7b to 30b), LLaMA (7b to 30b), Vicuna (13b, 33b), WizardLM (13b), LLaMA-2-chat (13b)

Training Method: Inference-time method

Key Hyperparameters:

temperature_t: 0.001 (scaling factor for sentence shifting)
number_of_generations_K: 5 or 10 (depending on model/experiment)
generation_temperature: 0.5

Compute: Requires K forward passes for generation + N semantic similarity checks (using RoBERTa-large/DistilRoBERTa). Inference latency reported as ~2.64s for 2-gen SAR vs 5.28s for 5-gen PE.

Comparison to Prior Work

vs. PE/LN-PE: SAR re-weights tokens based on semantic relevance, ignoring 'filler' words
vs. Semantic Entropy (SE): SAR uses 'soft' similarity scores and token-level granularity, whereas SE uses binary entailment clustering and operates only at the sentence level
vs. SelfCheckGPT [not cited in paper]: SelfCheckGPT checks consistency across samples; SAR similarly uses consistency but integrates it directly into the entropy formula via attention shifting

Limitations

Introduces computational overhead due to sentence similarity calculations (though uses small backbones)
Requires access to token logits (white-box access), limiting use with some closed APIs
Performance depends on the quality of the auxiliary semantic similarity model (e.g., RoBERTa)

Reproducibility

Code: https://github.com/jinhaoduan/SAR

Publicly available code at GitHub. Uses off-the-shelf models from HuggingFace. Standard datasets (CoQA, TriviaQA, etc.) used.

📊 Experiments & Results

Evaluation Setup

Free-form Question Answering on 5 datasets

Benchmarks:

CoQA (Conversational QA)
TriviaQA (Reading Comprehension)
SciQ (Science Q&A)
MedQA (Medical Q&A)
MedMCQA (Medical Q&A)

Metrics:

AUROC (Area Under Receiver Operator Characteristic)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Instruction-Tuned LLMs (Vicuna, WizardLM, LLaMA-2-chat) showing SAR consistently outperforming baselines on TriviaQA and SciQ.
Trivia QA	AUROC	0.630	0.749	+0.119
SciQ	AUROC	0.675	0.741	+0.066
Trivia QA	AUROC	0.634	0.744	+0.110
Trivia QA	AUROC	0.622	0.704	+0.082
CoQA	AUROC	0.723	0.748	+0.025
Medical Domain Evaluation (MedQA, MedMCQA) showing robustness in specialized domains.
MedMCQA	AUROC	0.685	0.717	+0.032

Experiment Figures

Correlations between relevance scores and uncertainty proportions

AUROC comparisons of Token-SAR, Sent-SAR, and full SAR against baselines on OPT and LLaMA models

Main Takeaways

Irrelevant tokens and sentences commit significant uncertainty; masking them improves reliability
Token-level (Token-SAR) and Sentence-level (Sent-SAR) shifting are orthogonal and achieve best results when combined
SAR is generation-efficient: achieves better results with 2 generations than baselines do with 5
Consistent improvements across varied model architectures (OPT, LLaMA, Vicuna) and domains (General, Scientific, Medical)

📚 Prerequisite Knowledge

Prerequisites

Understanding of auto-regressive language generation
Information theory concepts (Entropy)
Semantic similarity embeddings

Key Terms

Predictive Entropy (PE): A measure of uncertainty calculated as the sum of the negative log-probabilities of tokens in a sequence

Semantic Entropy (SE): An uncertainty metric that groups generations by meaning and calculates entropy over semantic clusters rather than raw text

Generative Inequality: The observation that tokens contribute unequally to the semantic meaning of a sentence, yet standard metrics treat them equally

AUROC: Area Under the Receiver Operating Characteristic curve—a metric used here to measure how well the uncertainty score distinguishes between correct and incorrect answers

Relevance Score: A metric quantifying a token's importance by measuring the semantic shift in a sentence when that token is removed

Aleatoric Uncertainty: Uncertainty arising from inherent randomness or noise in the data

Epistemic Uncertainty: Uncertainty arising from a lack of knowledge in the model parameters