Entropy-Based Decoding for Retrieval-Augmented Large Language Models

📝 Paper Summary

Modularized RAG pipeline Hallucination suppression

A training-free decoding method that weights retrieved documents by their entropy (uncertainty) and contrasts this against the model's internal parametric knowledge to prioritize factual external information.

Core Problem

RAG systems suffer from distractibility: irrelevant retrieved documents ('lost in the middle') confuse the model, and internal parametric knowledge often overrides correct external evidence during generation.

Why it matters:

LLMs frequently halluncinate or produce outdated information despite having access to correct retrieved documents
Current methods to fix this require expensive fine-tuning or training, which is impractical in resource-constrained environments
Naive RAG concatenation fails when the correct document is buried in the middle of retrieved distractors

Concrete Example: In a 'lost in the middle' scenario where the correct answer document is surrounded by distracting documents, a standard LLM often fails to generate the correct answer. The paper shows naive RAG performance drops significantly unless the oracle document is at the very start or end.

Key Novelty

Entropy-Based Decoding (CLeHe)

Uses 'Low-entropy Ensemble' (LeEns) to weight retrieved documents: if a document makes the LLM less uncertain (lower entropy) about the next token, it gets a higher vote
Applies 'Contrastive Decoding' by subtracting the logits of a high-entropy internal layer (representing ambiguous parametric knowledge) from the ensemble logits to suppress hallucinations

Architecture

The CLeHe pipeline: Parallel processing of retrieved documents, entropy-based weighting, and contrastive decoding against internal knowledge.

Evaluation Highlights

Surpasses Naive RAG by significant margins across Llama-2 (7B/13B), Mistral-7B, and Llama-3-8B on NQ, TriviaQA, WebQ, and PopQA
Achieves robust performance in 'lost in the middle' synthetic tests where naive concatenation fails completely when the oracle document is not at the edges
Outperforms RePlug (a retriever-score-based ensemble baseline) in almost all settings, proving that model uncertainty is a better signal than retrieval scores

Breakthrough Assessment

7/10

Strong practical contribution for training-free RAG improvement. Effectively addresses both retrieval noise and parametric memory conflict without requiring any model updates, though the scope is limited to decoding time.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where a query x and retrieved documents D are used to generate response y

Inputs: Query x, set of retrieved documents D

Outputs: Generated response y (token by token)

Pipeline Flow

Parallel Document Processing (LeEns)
Layer Selection & Contrast (CLeHe)

System Modules

Document-Parallel Decoder (Parallel Document Processing (LeEns))

Process the query + each retrieved document independently to get next-token distributions

Model or implementation: Target LLM (e.g., Llama-2-7B)

Entropy-Based Ensembler (Parallel Document Processing (LeEns))

Aggregate logits from all documents using entropy-based weights

Model or implementation: Algorithm (Weighted Average)

Internal Knowledge estimator (Layer Selection & Contrast (CLeHe))

Identify the model's internal uncertainty without external context

Model or implementation: Target LLM layers

Contrastive Adjuster (Layer Selection & Contrast (CLeHe))

Refine final distribution by contrasting ensemble against internal knowledge

Model or implementation: Algorithm (Subtraction)

Novel Architectural Elements

Dynamic layer selection for contrastive decoding: selects the specific layer with highest entropy at each step to represent 'ambiguous' internal knowledge
Entropy-weighted product-of-experts ensemble for integrating multiple retrieved documents during decoding

Modeling

Base Model: Llama-2-7B, Llama-2-13B, Mistral-7B-v0.1, Llama-3-8B

Compute: Inference only. Decoding time increases by factor < 1.18 compared to Naive RAG.

Comparison to Prior Work

vs. NaiveRAG: Uses parallel decoding and ensemble weighting instead of concatenation, avoiding 'lost in the middle'
vs. RePlug: Weights by output entropy (dynamic/internal confidence) rather than retriever score (static/external)
vs. CAD: Contrasts against the dynamic *highest entropy* layer rather than the fixed last layer to avoid overconfidence issues
+ 1 more
vs. DoLa: Focuses on contrasting external vs. internal knowledge, whereas DoLa contrasts layers within the same context-free generation

Limitations

Inference cost is slightly higher than Naive RAG due to parallel processing of documents (though parallelizable)
Requires access to internal model layers and logits, making it inapplicable to black-box APIs
Performance gains from the contrastive component (CLeHe) over the ensemble component (LeEns) are minimal for stronger models like Llama-3-8B

Reproducibility

Code: https://github.com/zexuanqiu/entropy-based-decoding

Code is publicly available. Hyperparameters provided: tau (0.1 or 0.25) and beta (0.25 or 5.0) vary by model. Candidate layers for contrast specified (e.g., 17-32 for 7B models). Validation set from WebQ used for hyperparameter tuning.

📊 Experiments & Results

Evaluation Setup

Open-domain QA using Wikipedia passages retrieved by DPR

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)
WebQ (Open-domain QA)
PopQA (Long-tail entity QA)

Metrics:

Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on 4 datasets showing CLeHe superiority over Naive RAG and RePlug baselines (Top-5 retrieval).
Natural Questions (NQ)	EM	26.54	32.05	+5.51
TriviaQA	EM	52.75	58.11	+5.36
PopQA	EM	30.07	33.72	+3.65
Natural Questions (NQ)	EM	40.33	42.99	+2.66
Ablation of layer selection strategy for contrastive decoding (Llama-2-7B on NQ).
Natural Questions (NQ)	EM	29.97	32.05	+2.08

Experiment Figures

Performance of Naive vs LeEns decoding as the position of the Oracle document changes among 20 retrieved documents ('Lost in the Middle' test).

Comparison of retrieval scores vs entropy scores for Oracle and Distractor documents.

Main Takeaways

LeEns (Entropy-based Ensemble) consistently outperforms Naive RAG and RePlug, showing that model uncertainty is a better weight than retrieval score or uniform averaging.
The 'Lost in the Middle' phenomenon is effectively mitigated; parallel processing ensures performance is independent of oracle document position.
Contrastive decoding (CLeHe) adds further gains primarily on smaller/weaker models (Llama-2), while stronger models (Llama-3, Mistral) benefit mostly from the ensemble step (LeEns) alone.
Selecting the highest entropy layer for contrast works better than using the last layer, preventing overconfidence issues during contrastive decoding.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Language Model Decoding (Logits, Softmax)
Entropy (Information Theory)
Contrastive Decoding

Key Terms

CLeHe: Contrasting Low-entropy distribution with High-entropy distribution—the paper's proposed method combining ensemble weighting and contrastive decoding

LeEns: Low-entropy Ensemble—the first component of the method that weights documents based on the entropy of the LLM's generation distribution given that document

Parametric Knowledge: Knowledge stored in the model's pre-trained weights (often outdated or hallucinated) as opposed to external retrieved context

Contrastive Decoding: A technique to adjust generation probabilities by maximizing the difference between a desired distribution (expert) and an undesired one (amateur/noise)

Lost in the middle: A phenomenon where LLMs fail to use relevant information if it appears in the middle of a long context window surrounded by irrelevant text

Pointwise Mutual Information (PMI): A measure used here to quantify the information gain provided by the external documents relative to the model's internal prior

Logit: The raw, unnormalized output score from the neural network before applying the softmax function

Product-of-experts: An ensemble method where probabilities from different sources are combined by multiplying them (or averaging their logits) rather than averaging probabilities

NaiveRAG: A baseline approach where retrieved documents are simply concatenated with the query into a single prompt