DoLa: Decoding by contrasting layers improves factuality in LLMs

📝 Paper Summary

Hallucination suppression Decoding strategies

DoLa improves LLM factuality by contrasting output logits from mature final layers against premature lower layers to amplify factual knowledge, without retrieval or fine-tuning.

Core Problem

Large language models frequently hallucinate by generating content that deviates from real-world facts, often prioritizing linguistic patterns (mass-seeking behavior) over factual accuracy.

Why it matters:

Hallucinations prevent safe deployment in high-stakes applications like clinical or legal settings where trustworthiness is crucial
Existing solutions often require expensive external retrieval, additional fine-tuning, or human labels, which may not always be feasible
Language models tend to learn 'lower-level' linguistic information in early layers and semantic/factual information in later layers, but standard decoding doesn't exploit this distinction

Concrete Example: When asked 'On what date was the Declaration of Independence officially signed?', a standard LLaMA model predicts 'July 4, 1776' (a common but factually incorrect date for the signing). DoLa contrasts layers to suppress this common misconception and correctly predicts 'August 2, 1776'.

Key Novelty

Decoding by Contrasting Layers (DoLa)

exploits the modular evolution of knowledge in transformers, where lower layers encode linguistic patterns and higher layers encode facts
dynamically selects a 'premature' layer based on Jensen-Shannon Divergence and subtracts its log-probabilities from the final layer to cancel out non-factual linguistic noise
amplifies the signal of factual knowledge that emerges only in the later layers of the model

Architecture

Conceptual illustration of DoLa. It shows the transformer layers processing a query about a capital city. The probabilities of the correct fact ('Olympia') rise in higher layers while common tokens ('Seattle') stay constant.

Evaluation Highlights

Improves TruthfulQA scores by 12-17% (absolute points) across LLaMA family models (7B to 65B)
Raises LLaMA-65B performance on TruthfulQA to 54.3% (%Truth*Info), rivaling methods that require supervised fine-tuning like ITI
Enhances reasoning on StrategyQA by up to 4% accuracy, showing benefits for chain-of-thought tasks

Breakthrough Assessment

8/10

Simple, inference-time-only method that yields significant double-digit gains in factuality without training or retrieval. Highly practical.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive next-token prediction focused on maximizing factual accuracy

Inputs: Input sequence context x < t

Outputs: Next token x_t probability distribution

Pipeline Flow

Compute logits at all candidate early layers (early exit)
Compute logits at the final N-th layer (mature layer)
Select premature layer M via max Jensen-Shannon Divergence
Contrast logits (Mature - Premature) with APC filtering
Apply Softmax to get final distribution

System Modules

Early Exit Projection

Project hidden states from intermediate layers to vocabulary space to get candidate distributions

Model or implementation: Same LLaMA backbone (shared head)

Dynamic Layer Selector

Identify which early layer is most distinct from the final layer to serve as the contrast target

Model or implementation: JSD Calculation

Contrastive Decoding Head

Compute final probabilities by subtracting premature logits from mature logits

Model or implementation: Logit subtraction + Softmax

Novel Architectural Elements

Dynamic premature layer selection mechanism based on Jensen-Shannon Divergence
Intra-model contrastive decoding (contrasting layers within one model rather than two separate models)

Modeling

Base Model: LLaMA (7B, 13B, 33B, 65B)

Compute: Inference only. 1.01x to 1.08x latency increase over baseline decoding.

Comparison to Prior Work

vs. CD: DoLa requires only one model (contrasts layers instead of separate models) and outperforms CD on factuality
vs. ITI: DoLa requires no labeled data or training of probing classifiers
vs. ACD: DoLa targets factuality in large LMs (LLaMA) via dynamic layer selection, whereas ACD targets diversity in small LMs via fixed layers and often increases hallucinations [ACD limitation noted in paper]

Limitations

Cannot correct misinformation acquired during pre-training (relies on internal knowledge)
Does not ground generation on external retrieved knowledge
Slight latency overhead (up to 8%) compared to greedy decoding
Performance depends on appropriate selection of candidate layer buckets

Reproducibility

Code: https://github.com/voidism/DoLa

Code publicly available. Method relies on standard pre-trained LLaMA models. Hyperparameters (buckets for candidate layers) are provided for each model size.

📊 Experiments & Results

Evaluation Setup

Multiple choice and open-ended generation tasks focused on factuality and reasoning

Benchmarks:

TruthfulQA (Short-answer factuality (Multiple Choice & Generation))
FACTOR (Long-paragraph factuality (Multiple Choice))
StrategyQA (Chain-of-thought reasoning)
GSM8K (Arithmetic reasoning)
Vicuna QA (Instruction following (Chatbot))

Metrics:

%Truth*Info (TruthfulQA)
Accuracy (StrategyQA, GSM8K, FACTOR)
GPT-4 Ratings (Vicuna QA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DoLa consistently improves factuality on TruthfulQA (open-ended generation) across all LLaMA model sizes compared to baseline decoding and standard Contrastive Decoding.
TruthfulQA (Open-Ended)	%Truth*Info	30.4	42.1	+11.7
TruthfulQA (Open-Ended)	%Truth*Info	38.8	48.8	+10.0
TruthfulQA (Open-Ended)	%Truth*Info	62.5	56.4	-6.1
TruthfulQA (Open-Ended)	%Truth*Info	31.7	49.1	+17.4
TruthfulQA (Open-Ended)	%Truth*Info	34.8	49.2	+14.4
DoLa improves chain-of-thought reasoning accuracy on StrategyQA compared to baselines.
StrategyQA	Accuracy	66.6	67.6	+1.0
StrategyQA	Accuracy	70.5	72.9	+2.4
On FACTOR (Wiki), DoLa improves multiple choice accuracy.
FACTOR (Wiki)	Accuracy	62.6	66.2	+3.6

Experiment Figures

Heatmap of Jensen-Shannon Divergence (JSD) between the final layer and early layers for each token in a sentence.

GPT-4 evaluation of Vicuna QA (chatbot) responses.

Main Takeaways

DoLa consistently improves truthfulness across multiple scales of LLaMA (7B-65B) without fine-tuning, often matching or beating supervised intervention methods (ITI).
Dynamic layer selection is crucial; 'factual' tokens usually show high divergence in upper layers, while function words show divergence in lower/middle layers.
In reasoning tasks (StrategyQA, GSM8K), contrasting with lower layers (early buckets) works better than higher layers, unlike in short-fact tasks.
DoLa incurs minimal latency cost (1-8%) compared to baselines, making it practical for real-world deployment.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (layers, logits, vocabulary head)
Next-token prediction / Autoregressive decoding
Kullback-Leibler (KL) Divergence / Jensen-Shannon (JS) Divergence
Logits and Softmax

Key Terms

hallucinations: Generated content that deviates from real-world facts observed during pretraining

DoLa: Decoding by Contrasting Layers—a strategy that subtracts lower-layer logits from final-layer logits to amplify factual signals

logits: Raw, unnormalized scores output by the model before the softmax function converts them into probabilities

Jensen-Shannon Divergence (JSD): A method to measure the similarity between two probability distributions; used here to detect when layer outputs change significantly

early exit: Obtaining a prediction from an intermediate layer of a neural network rather than processing all the way to the final layer

premature layer: An early or middle transformer layer selected for contrast because it contains linguistic patterns but lacks full factual knowledge

mature layer: The final transformer layer, assumed to contain the most complete semantic and factual information

contrastive decoding: A decoding method that finds tokens with high probability in an 'expert' model but low probability in an 'amateur' model; DoLa adapts this to layers within one model

adaptive plausibility constraint (APC): A filtering rule that sets the probability of tokens to zero if their likelihood in the expert/mature model is too low, preventing implausible outputs

TruthfulQA: A benchmark designed to measure whether language models generate false answers that mimic human misconceptions

Chain-of-Thought (CoT): A prompting strategy where the model generates intermediate reasoning steps before the final answer