PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality

📝 Paper Summary

Hallucination suppression Contrastive Decoding

PruneCD improves large language model factuality by using a layer-pruned version of the model itself as an amateur for contrastive decoding, yielding more informative contrasts than early exit methods.

Core Problem

Existing contrastive decoding methods like DoLa use 'early exit' logits as a negative reference, but these logits are often flat, low-magnitude, and uninformative, failing to provide a meaningful contrast for correcting hallucinations.

Why it matters:

Hallucinations in LLMs generate fluent but factually incorrect outputs, especially for underrepresented knowledge
Current single-model contrastive methods (like DoLa) often select the earliest possible exit layer, resulting in weak signals that cannot effectively suppress incorrect tokens
Deploying separate amateur models for contrastive decoding is computationally expensive; single-model solutions are preferred but need to be effective

Concrete Example: When asked 'Who was the next British Prime Minister after Arthur Balfour?', the expert model might hallucinate 'Herbert Henry Asquith'. DoLa's early exit logits are flat (near-uniform distribution) and offer no strong signal to correct this. PruneCD's pruned model retains structure but is less confident, allowing the contrast to successfully amplify the correct answer 'Campbell-Bannerman'.

Key Novelty

PruneCD (Contrastive Decoding via Layer Pruning)

Constructs the 'amateur' model by skipping specific intermediate layers (pruning) rather than exiting early, preserving the final sharpening layers while degrading factual content
Identifies optimal pruning layers via an efficient ablation search that maximizes the drop in truthfulness on a validation set
Implements contrastive decoding in a single forward pass using batched inference (batch 0 = full model, batch 1 = pruned model) to minimize latency

Architecture

The overall pipeline of PruneCD, illustrating the offline layer search and the runtime batched inference mechanism.

Evaluation Highlights

+13.67% improvement in Truthfulness (TruthfulQA Gen) over DoLa on Llama-3.1-8B-Instruct (92.78% vs 79.11%)
Achieves superior performance on TruthfulQA, TriviaQA, and Natural Questions across multiple model sizes (1B, 3B, 8B) compared to DoLa, Activation Decoding, and END
Maintains inference speed comparable to greedy decoding (33.7 tokens/s vs 35.8 tokens/s) due to efficient batched implementation

Breakthrough Assessment

8/10

Significantly outperforms strong baselines like DoLa and END with a simpler, more intuitive mechanism (pruning vs early exit) while remaining computationally efficient. A practical drop-in replacement for existing decoding strategies.

⚙️ Technical Details

Problem Definition

Setting: Next-token generation where the goal is to maximize the probability of factual tokens while suppressing hallucinated tokens using contrastive decoding.

Inputs: Input token sequence x_<t

Outputs: Next token x_t selected via contrastive distribution

Pipeline Flow

Factual Layer Search (Offline: Identify layers responsible for facts)
Batched Inference (Runtime: Compute expert and amateur logits simultaneously)
Contrastive Decoding (Runtime: Subtract amateur logits from expert logits)

System Modules

Factual Layer Search

Identify the set of layers S* that, when pruned, cause the largest drop in factual accuracy

Model or implementation: Llama-3 models (various sizes)

Expert/Amateur Forward Pass (Inference)

Generate logits for both the full model (expert) and pruned model (amateur) in parallel

Model or implementation: Shared Transformer Backbone

Contrastive Decoder (Inference)

Adjust next-token probabilities by penalizing the amateur's confidence

Model or implementation: N/A (Mathematical operation)

Novel Architectural Elements

Construction of amateur model via layer skipping (pruning) specifically for contrastive decoding, rather than using a separate small model or early exit
Batched inference pipeline where one batch index executes the full model and another executes the pruned topology simultaneously

Modeling

Base Model: Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Llama-3.2-1B-Instruct

Training Method: Inference-time intervention only (no training)

Compute: Inference on A100 80GB GPU. Latency is comparable to greedy decoding (e.g., ~33 tokens/s).

Comparison to Prior Work

vs. DoLa: Uses pruned model (skipping intermediate layers but keeping final layers) instead of early exit (truncating final layers); results in more informative amateur logits
vs. ActD/END: Achieves contrastive effects via modified model topology rather than activation/entropy statistics; consistently outperforms on factuality benchmarks
vs. MCD: PruneCD optimizes layer selection specifically for general factuality (TruthfulQA degradation) rather than language domains; outperforms MCD on standard QA tasks

Limitations

Batched inference may increase memory footprint due to additional internal activations, though not observed as significant in small-batch settings
Factual layer search is static and greedy; gradient-based or dynamic search strategies could potentially be more effective
Search granularity is limited to the decoder layer level; finer-grained pruning (attention/feed-forward blocks) is not explored

Reproducibility

Code: https://github.com/hoeng4/PruneCD

Code is publicly available at https://github.com/hoeng4/PruneCD. The paper details hyperparameters (CD temperature lambda, number of pruned layers k) for each model size. It specifies the search strategy for layers (MC1 score drop). Evaluation uses standard benchmarks (TruthfulQA, TriviaQA, etc.) and publicly available judge models (fine-tuned Llama-2-7B for TruthfulQA evaluation).

📊 Experiments & Results

Evaluation Setup

Zero-shot or few-shot question answering and open-ended generation

Benchmarks:

TruthfulQA (Factuality (MC and Generation))
TriviaQA (Knowledge Retrieval)
Natural Questions (NQ) (Knowledge Retrieval)
StrategyQA (StrQA) (Reasoning)
GSM8K (Math Reasoning)
VicunaQA (Instruction Following)

Metrics:

Truthfulness (%Truth)
Informativeness (%Info)
Exact Match (EM)
F1 Score
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on Llama-3.1-8B-Instruct show PruneCD outperforming baselines on open-ended generation (TruthfulQA) and standard QA tasks.
TruthfulQA Gen	%Truth	79.11	92.78	+13.67
TruthfulQA Gen	%Info	66.46	85.19	+18.73
TriviaQA	EM	67.29	67.49	+0.20
Natural Questions	F1	33.63	35.42	+1.79
Results on smaller models (Llama-3.2-3B) demonstrate PruneCD's effectiveness is scalable and robust.
TruthfulQA Gen	%Truth	72.03	91.39	+19.36
StrategyQA	%Acc	68.47	69.87	+1.40
Fixed hyperparameter analysis shows PruneCD generalizes better than baselines without task-specific tuning.
GSM8K	%Acc	73.4	75.6	+2.2

Experiment Figures

Comparison of logit distributions between the original model, early exit (DoLa), and layer-pruned model.

Conceptual comparison of logits and decoding outcomes for a specific question about the British Prime Minister.

Main Takeaways

PruneCD consistently outperforms DoLa and other baselines across diverse benchmarks (factuality, reasoning, instruction following) and model sizes (1B, 3B, 8B, Mistral-7B).
Unlike DoLa, which produces flat and uninformative amateur logits (high entropy, low overlap), PruneCD produces logits that retain meaningful structure while being sufficiently distinct from the expert.
The method is robust to hyperparameter settings; fixed-parameter experiments show it still beats baselines that require per-task tuning.
Qualitative analysis shows PruneCD often corrects common misconceptions (e.g., origin of fortune cookies) where baselines fail or refuse to answer.

📚 Prerequisite Knowledge

Prerequisites

Contrastive Decoding (CD)
Transformer architecture (layers, logits)
Logit analysis (entropy, top-k overlap)
Jensen-Shannon Divergence (used in baselines)

Key Terms

early exit: A technique where model predictions are generated from intermediate layers rather than the final layer

contrastive decoding: A decoding strategy that subtracts the log-probabilities of an 'amateur' (weaker) model from an 'expert' (stronger) model to penalize common errors

logits: The raw, unnormalized scores output by the final layer of a neural network before the softmax function converts them to probabilities

DoLa: Decoding by Contrasting Layers—a baseline method that uses early exit layers as the amateur model for contrastive decoding

Jensen-Shannon divergence: A method of measuring the similarity between two probability distributions

MC score: Multiple Choice score—used here to measure accuracy on QA benchmarks like TruthfulQA-MC

SLEB: Streamlining LLMs through Redundancy Verification—a pruning method used here to filter candidate layers based on perplexity impact

batched inference: Processing multiple inputs (or model configurations) simultaneously in one GPU operation to save time

informativeness: A metric defined in the paper measuring the overlap between the top-k tokens of the amateur and expert models

flatness: A property of probability distributions measured by entropy; high flatness means the distribution is near-uniform/uncertain