Improve Decoding Factuality by Token-wise Cross Layer Entropy of Large Language Models

📝 Paper Summary

Hallucination suppression Inference-time intervention

END is a decoding method that mitigates hallucinations by measuring how sharply the prediction probability of specific tokens grows across hidden layers, prioritizing those that reflect emerging factual knowledge.

Core Problem

LLMs often generate hallucinations even when they possess correct knowledge, and existing mitigation methods like training or external retrieval are computationally expensive.

Why it matters:

Hallucinations prevent LLM adoption in high-stakes industries requiring accuracy
Existing layer-contrast methods (like DoLa) assume a single 'best' layer for all tokens, but different factual tokens exhibit different growth trends across layers
Training-based solutions require high-quality data and compute that may not be accessible in many scenarios

Concrete Example: When generating the name 'Sun Yat-sen', the probability of the factual token 'Sun' grows sharply in higher layers, while common functional words remain stable. Methods that pick a fixed contrast layer might miss this token-specific spike or falsely amplify non-factual tokens.

Key Novelty

Cross-Layer Entropy Enhanced Decoding (END)

Instead of contrasting just two layers (like DoLa), END tracks the evolution of token probabilities across multiple upper layers for *each* candidate token individually
Calculates 'cross-layer entropy' to quantify the sharpness of this probability growth; a sharp trend indicates the token is factual knowledge being actively retrieved
Adjusts the final output distribution to boost tokens with low cross-layer entropy (high factual confidence) without any extra training

Architecture

The END decoding framework workflow. It illustrates extracting hidden states, forming cross-layer distributions for candidate tokens, calculating entropy, and re-ranking the final output.

Evaluation Highlights

+12-21% improvement in Truth*Info scores on TruthfulQA open-ended generation compared to baselines like DoLa and greedy decoding
Achieves highest MC1 (28.9) and MC2 scores on TruthfulQA multiple-choice, surpassing DoLa and Inference-Time Intervention
Maintains or improves general QA performance (+10.1% accuracy on Natural Questions) while reducing rejection rates in open-ended generation

Breakthrough Assessment

7/10

Strong empirical results on standard benchmarks and a logical extension of prior layer-contrast work. Being training-free makes it highly practical, though it builds heavily on existing insights about layer-wise knowledge activation.

⚙️ Technical Details

Problem Definition

Setting: Next-token prediction in autoregressive language models, specifically modifying the decoding probability distribution to favor factuality

Inputs: Context sequence of tokens up to time t

Outputs: Adjusted probability distribution over the vocabulary for the next token v_t

Pipeline Flow

Forward Pass (compute logits for all layers)
Candidate Filtering (select top-k tokens based on final layer)
Cross-Layer Distribution Construction (extract probs for candidates across upper layers)
Entropy Calculation (compute entropy of the layer-wise trend for each candidate)
Logit Adjustment (boost candidates with low entropy/sharp trends)

System Modules

Candidate Filter

Selects a subset of tokens V_head to process, improving efficiency

Model or implementation: Rules based on final layer probability P_N

Cross-Layer Profiler

Constructs a probability distribution over layers for each candidate token to observe its growth trend

Model or implementation: Mathematical projection

Entropy Adjuster

Modifies the final prediction logits based on the calculated cross-layer entropy

Model or implementation: Log-linear adjustment formula

Novel Architectural Elements

Token-wise cross-layer entropy mechanism: Dynamically quantifying 'factuality' per token by analyzing the sharpness of its probability evolution across layers, rather than using a static contrast layer

Modeling

Base Model: Llama-2-7B-chat

Training Method: Inference-time decoding intervention (no training)

Key Hyperparameters:

filter_threshold_alpha: [0.001, 0.1]
entropy_coefficient_lambda_open_ended: [1, 3]
entropy_coefficient_lambda_mc_qa: [0.25, 0.5]

Compute: Inference only; overhead reduced by filtering candidate tokens

Comparison to Prior Work

vs. DoLa: END analyzes the *trend* across multiple layers for *each* token, whereas DoLa picks one contrast layer for the whole step
vs. ITI: END requires no probe training or internal activation editing, just decoding adjustment
vs. CD (Contrastive Decoding) [not cited in paper]: END uses internal layers of the same model rather than an external amateur/expert model pair

Limitations

Computational overhead of calculating entropy for multiple tokens across multiple layers at every step (mitigated by filtering)
Performance degrades on smaller/weaker models (e.g., disruptive behavior on Mistral-7B-v0.1) that lack robust intrinsic predictions
Requires careful tuning of hyperparameters (lambda) which vary significantly between open-ended and multiple-choice tasks

Reproducibility

Code: https://github.com/Arcade-Master/END

Code is publicly available at https://github.com/Arcade-Master/END. The paper details hyperparameter ranges for different tasks (alpha, lambda). Uses standard LLaMA-2 models and public benchmarks.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on hallucination and QA benchmarks

Benchmarks:

TruthfulQA (Hallucination evaluation (Multiple Choice & Generation))
FACTOR (Factuality in long-context reading comprehension)
Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)

Metrics:

MC1/MC2/MC3 (TruthfulQA)
Truth*Info score (TruthfulQA Generation)
% Reject (TruthfulQA Generation)
Exact Match / F1 (QA tasks)
Accuracy (FACTOR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TruthfulQA (Multiple Choice)	MC1	28.0	28.9	+0.9
TruthfulQA (Multiple Choice)	MC2	44.7	45.5	+0.8
TruthfulQA (Generation)	Truth*Info	39.95	48.33	+8.38
TruthfulQA (Generation)	% Reject	23.26	8.45	-14.81
Natural Questions	Accuracy (Exact Match implied)	23.7	26.1	+2.4
FACTOR (Expert)	Accuracy	55.1	56.4	+1.3

Experiment Figures

Analysis of hidden state prediction changes. Left: Step-level KL divergence across layers. Right: Token-level probability growth across layers for specific vocabulary candidates.

Main Takeaways

Significantly improves 'informativeness' in open-ended generation by reducing refusal rates ('I have no comment') while maintaining high truthfulness
Consistent improvements across model scales (7B, 13B, 70B) and families (Llama-2, Qwen, Mistral), though 70B shows smaller relative gains due to stronger base performance
Effectively balances boosting factual tokens without degrading basic QA capabilities, unlike some intervention methods that trade off general performance

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (layers, hidden states, logits)
Language model decoding strategies (greedy, sampling)
Entropy (Shannon entropy) as a measure of distribution sharpness
Concept of 'internalization' of knowledge in LLM layers

Key Terms

cross-layer entropy: A metric proposed in this paper that measures the uncertainty of a token's probability distribution formed by collecting its predictions across multiple hidden layers

DoLa: Decoding by Contrasting Layers—a baseline method that contrasts logits from a specific early layer against the final layer to amplify factual knowledge

MC1/MC2/MC3: Metrics for multiple-choice tasks; MC1 is accuracy of the single best option, MC2/MC3 involve multi-select or probability mass comparisons

Truth*Info: A composite metric for open-ended generation combining truthfulness and informativeness scores from a GPT-based judge

hidden state: The vector representation of the input at a specific layer of the Transformer model before it is projected into vocabulary logits

logits: Raw, unnormalized prediction scores generated by the model's output head

KL-divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a second, reference distribution