DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

📝 Paper Summary

Hallucination suppression Inference-time intervention

DoLa improves LLM factuality by contrasting output logits from mature later layers against premature earlier layers, amplifying factual knowledge that emerges only in the final stages of processing.

Core Problem

LLMs frequently hallucinate content inconsistent with pretraining facts because the maximum likelihood objective encourages mass-seeking behavior, relying on superficial linguistic patterns from earlier layers rather than factual knowledge.

Why it matters:

Deployment in high-stakes fields (clinical, legal) is bottlenecked by the generation of untrustworthy text
Existing solutions often require expensive retrieval (RAG) or additional fine-tuning (RLHF), which adds complexity and training cost
Linguistic patterns (syntax) often dominate generation probability even when the factual content is incorrect

Concrete Example: When asking 'The capital of Washington is...', the token 'Seattle' maintains high probability across all layers due to syntactic plausibility. The correct answer 'Olympia' only sees a probability spike in the final layers. Standard decoding might pick 'Seattle' or mix them up, whereas contrasting the layers reveals 'Olympia' as the factual choice.

Key Novelty

Decoding by Contrasting Layers (DoLa)

Leverages the observation that lower transformer layers encode linguistic/syntactic information, while factual knowledge tends to localize in higher layers
Dynamically selects a 'premature' layer based on Jensen-Shannon Divergence and subtracts its log-probabilities from the final 'mature' layer to cancel out non-factual linguistic noise
Requires no external retrieval, no model fine-tuning, and only adds a small latency overhead during decoding

Architecture

Overview of the DoLa decoding process during inference.

Evaluation Highlights

Improves TruthfulQA scores by 12-17% absolute points across LLaMA family models (7B to 65B), matching methods that use supervised fine-tuning (ITI)
Outperforms Contrastive Decoding (CD) on multiple choice and open-ended generation, improving 'Truth*Info' scores significantly without the high refusal rate seen in CD
Demonstrates consistent gains on reasoning tasks (StrategyQA, GSM8K), boosting accuracy by 1-4% while baselines like CD often degrade performance

Breakthrough Assessment

8/10

Simple, effective, and parameter-free method that significantly reduces hallucinations. It offers a practical inference-only solution to a major LLM problem without requiring retraining or external modules.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive next-token prediction focused on maximizing factuality without external resources

Inputs: Sequence of tokens context x_{<t}

Outputs: Next token x_t probability distribution

Pipeline Flow

Forward Pass (compute hidden states for all layers)
Premature Layer Selection (calculate JSD between candidate early layers and final layer)
Logit Contrast (subtract premature logits from mature logits)
Adaptive Plausibility Constraint (filter low-probability tokens)
Next Token Selection (sampling or greedy decoding)

System Modules

Backbone LLM

Generate hidden states for all layers

Model or implementation: LLaMA (7B, 13B, 33B, 65B)

Dynamic Layer Selector (Decoding Strategy)

Identify which early layer differs most from the final layer to serve as the contrast baseline

Model or implementation: JSD Calculation

Contrastive Head (Decoding Strategy)

Compute final probability distribution by contrasting layers

Model or implementation: Log-prob subtraction

Novel Architectural Elements

Dynamic premature layer selection mechanism based on Jensen-Shannon Divergence within a single model pass
Layer-contrastive projection: projecting intermediate hidden states to the vocabulary space to contrast with the final layer

Modeling

Base Model: LLaMA (7B, 13B, 33B, 65B)

Training Method: Not applicable — Inference-only method

Adaptation: None

Trainable Parameters: 0 (Frozen model)

Key Hyperparameters:

adaptive_plausibility_constraint_alpha: 0.1
repetition_penalty_theta: 1.2

Compute: Single forward pass per token (standard inference), plus small overhead for projecting intermediate layers to vocabulary

Comparison to Prior Work

vs. Contrastive Decoding (CD): DoLa requires only one model (self-contrast) rather than two separate models, avoiding the need for a separate amateur model and reducing memory footprint.
vs. Inference Time Intervention (ITI): DoLa does not require any training data or classifier training; it is purely inference-time.
vs. Early Exit [not cited in paper]: Early exit seeks to speed up inference by stopping early; DoLa uses early layers to *correct* the final layer, not replace it.

Limitations

Relies on the assumption that factual knowledge is localized in higher layers, which may vary by model architecture or training paradigm.
Requires projecting intermediate hidden states to the vocabulary space, which might not always yield semantically meaningful distributions for all layers.
Introduces a small latency overhead due to computing extra projections and divergences at each step.

Reproducibility

Code: https://github.com/voidism/DoLa

Code is publicly available at https://github.com/voidism/DoLa. The method uses standard pretrained LLaMA models. Candidate layer buckets for dynamic selection need to be chosen via a validation set (e.g., 2-4 runs).

📊 Experiments & Results

Evaluation Setup

Zero-shot factuality and reasoning evaluation across multiple benchmarks

Benchmarks:

TruthfulQA (Multiple Choice and Open-ended Generation)
FACTOR (Multiple Choice (News/Wiki))
StrategyQA (Chain-of-Thought Reasoning)
GSM8K (Arithmetic Reasoning)
Vicuna QA (Open-ended Chatbot Evaluation (GPT-4 rated))

Metrics:

MC1/MC2/MC3 (Multiple Choice Accuracy)
Truthfulness %
Informativeness %
%Truth*Info (Combined Metric)
Accuracy (for reasoning tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DoLa significantly improves TruthfulQA scores across all LLaMA model sizes, often matching or beating supervised intervention methods like ITI.
TruthfulQA (Multiple Choice)	MC3 Score	26.9	32.6	+5.7
TruthfulQA (Multiple Choice)	MC3 Score	29.7	32.6	+2.9
TruthfulQA (Generation)	%Truth*Info	36.2	53.4	+17.2
TruthfulQA (Generation)	%Truth*Info	34.0	53.4	+19.4
Performance on reasoning benchmarks shows DoLa preserves or improves reasoning capabilities, unlike Contrastive Decoding which often degrades them.
StrategyQA (CoT)	Accuracy	65.5	67.5	+2.0
StrategyQA (CoT)	Accuracy	62.4	67.5	+5.1
GSM8K (CoT)	Accuracy	26.9	29.1	+2.2

Experiment Figures

Motivating example: Evolution of probability for 'Seattle' vs 'Olympia' across layers for the prompt 'The capital of Washington is...'

JSD values across layers for different token types during generation.

Main Takeaways

Contrasting layers within a single model effectively surfaces factual knowledge while suppressing linguistic hallucinations.
Higher layers generally contain more factual knowledge for short-answer tasks (TruthfulQA), while lower layers are useful contrasts for reasoning tasks (StrategyQA, GSM8K) and long-form generation (FACTOR).
Dynamic layer selection is crucial: the 'premature' layer is not static and changes based on token difficulty.
DoLa avoids the 'refusal' problem seen in Contrastive Decoding (where models answer 'I have no comment'), maintaining high informativeness scores.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (layers, logits, vocabulary heads)
Language model decoding strategies (greedy, sampling)
Kullback-Leibler (KL) and Jensen-Shannon Divergence (JSD)

Key Terms

DoLa: Decoding by Contrasting Layers—the proposed method that subtracts earlier layer logits from final layer logits

Premature Layer: An intermediate transformer layer selected dynamically to represent lower-level linguistic information

Mature Layer: The final transformer layer containing fully processed, semantic, and factual information

Contrastive Decoding (CD): A baseline method that contrasts logits between a small 'amateur' model and a large 'expert' model

JSD: Jensen-Shannon Divergence—a symmetric measure of similarity between two probability distributions, used here to select the premature layer

APC: Adaptive Plausibility Constraint—a filtering technique that masks tokens with low probability in the expert/mature model to prevent implausible tokens from being boosted

ITI: Inference Time Intervention—a baseline method that shifts model activations during inference using a classifier trained on truthful data

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer