Less is More for RAG: Information Gain Pruning for Generator-Aligned Reranking and Evidence Selection

📝 Paper Summary

Modularized RAG pipeline

Information Gain Pruning (IGP) replaces relevance-based reranking with a generator-aligned utility signal—measured by the reduction in uncertainty (entropy) of the generator's next-token distribution—to filter out weak or harmful evidence before truncation.

Core Problem

In RAG systems with limited context budgets, standard relevance metrics (e.g., NDCG) often correlate weakly or negatively with end-to-end generation quality because highly relevant documents can introduce redundancy, ambiguity, or conflicts that destabilize the generator.

Why it matters:

Improving retrieval relevance does not reliably improve answer quality (Relevance-Utility Mismatch).
Injecting multiple pieces of evidence often introduces noise that consumes budget without adding marginal utility.
Existing rerankers focus on relevance rather than the generator's actual need for information to resolve uncertainty.

Key Novelty

Generator-Aligned Information Gain Pruning (IGP)

Define 'Information Gain' (IG) as the reduction in the generator's normalized uncertainty (entropy over Top-K tokens) when conditioned on a candidate passage versus no context.
Rerank passages by IG and apply a pruning threshold to discard negative-utility or weak-utility passages before the final Top-M truncation.
The method is label-free, training-free, and parameter-free, relying only on black-box access to step-wise logits.

Architecture

Comparison of standard relevance reranking pipeline versus IGP. IGP introduces an 'Evidence Utility Assessment' step that filters out 'Helpful' vs 'Harmful' docs based on uncertainty reduction.

Evaluation Highlights

Relevance-Utility Mismatch: On NQ, higher NDCG@5 often correlates with lower F1 (Spearman = -0.54), proving relevance != utility.
Multi-evidence (TopM=5) Win-Win: On average across 5 datasets (BM25 retriever), IGP(0.05) improves F1 from ~0.288 (BM25) to ~0.322 while reducing input tokens from ~836 to ~202 (approx. 76% reduction).
Tight Budget (TopM=1): IGP(0.05) improves average F1 from ~0.221 to ~0.312 compared to BM25 baseline.
Scale Efficiency: Qwen-1.5B with IGP outperforms Qwen-7B without IGP on NQ (F1 ~0.2 vs ~0.17).
Robustness: Improvements persist across different retrievers (BM25 vs Contriever) and generator families (Qwen2.5 vs Llama-3.x).

Breakthrough Assessment

7/10

The paper identifies a critical misalignment in standard RAG (relevance vs. utility) and provides a highly practical, training-free solution that significantly improves the cost-quality Pareto frontier. While the core idea of entropy reduction isn't theoretically new, its application as a pre-generation pruning mechanism for RAG is impactful and deployment-friendly.

⚙️ Technical Details

Pipeline Flow

Retrieve N candidate passages (e.g., using BM25 or Contriever).
For each candidate d, compute Information Gain (IG): Unconditional Uncertainty - Conditional Uncertainty(d).
Rerank candidates by IG.
Prune candidates where IG < Threshold (Tp).
Truncate remaining list to budget M (Top-M).
Generate final answer using selected evidence.

System Modules

Uncertainty Estimator

Calculate Normalized Uncertainty (NU) using Top-K token entropy averaged over a greedy rollout.

Model or implementation: Black-box LLM (Qwen2.5/Llama-3)

IGP Module

Rerank and Prune

Model or implementation: Algorithm 1

📊 Experiments & Results

Evaluation Setup

Open-domain QA on 5 benchmarks using Retrieve-Rerank-Truncate pipeline.

Benchmarks:

Natural Questions (NQ) (QA)
TriviaQA (QA)
PopQA (QA)
SQuAD (QA)
AmbigQA (QA)

Metrics:

Token-level F1
Average Input Tokens (TK)
Normalized Token Efficiency (NTE)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average of 5 Datasets	F1 (TopM=5)	0.288	0.322	+0.034 (~11.8%)
Average of 5 Datasets	F1 (TopM=1)	0.221	0.312	+0.091 (~41%)
Average of 5 Datasets	F1 (TopM=5)	0.286	0.322	+0.036
NQ (TopM=5)	Spearman Correlation (NDCG vs F1)	N/A	-0.54	N/A

Experiment Figures

Scatter plots showing weak or negative correlation between NDCG (Retrieval metric) and F1 (Generation metric) on NQ, motivating the need for utility-based selection.

Pareto frontier of Quality (F1) vs Cost (Input Tokens). IGP shifts the frontier to the upper-left (better quality, lower cost) compared to baselines like BM25, CE, BGE.

Scaling laws with/without IGP. Qwen-1.5B with IGP matches/exceeds Qwen-7B without IGP, showing that better context selection acts as a model capability multiplier.

Main Takeaways

Relevance is a poor proxy for utility in budgeted RAG, especially when multiple passages are retrieved.
Information Gain (uncertainty reduction) is a robust signal for evidence selection that aligns with generator needs.
Pruning (admission control) is more effective than simple re-ranking because it prevents relevant-but-noisy/redundant evidence from consuming context budget.
IGP enables smaller models to outperform larger models by providing higher-quality, less noisy context.