InfoGain-RAG: Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering

📝 Paper Summary

Modularized RAG pipeline

InfoGain-RAG introduces a metric called Document Information Gain (DIG) to quantify how much a retrieved document increases an LLM's confidence in the correct answer, training a reranker to filter and sort documents accordingly.

Core Problem

Current RAG frameworks struggle to identify whether retrieved documents actually contribute to correct answer generation, often retrieving irrelevant or misleading content based solely on semantic similarity.

Why it matters:

Semantic similarity does not equate to utility; high similarity documents can still be unhelpful or distracting for the generation task
Existing rerankers (like BGE) focus on fine-grained semantic matching rather than the downstream impact on generation quality
Self-reflection methods that do assess utility require multiple expensive LLM calls, making them computationally impractical for real-time applications

Concrete Example: A document might be semantically similar to a query about a movie plot but contain outdated or slightly incorrect details. A standard reranker prioritizes it due to keyword overlap, confusing the generator. InfoGain-RAG would score it negatively if it lowers the model's confidence in the true answer compared to using no document.

Key Novelty

Document Information Gain (DIG) Metric & Multi-Task Reranker

Defines a metric (DIG) that calculates the difference in an LLM's confidence for the ground-truth answer when a specific document is present versus absent
Trains a lightweight reranker using a multi-task objective: classifying whether a document provides positive/negative gain and ranking documents by their gain magnitude

Architecture

The complete pipeline of InfoGain-RAG, including the data collection process (calculating DIG) and the inference phase with the trained reranker.

Evaluation Highlights

+17.9% Exact Match improvement on NaturalQA using LLaMA-3.1-405B compared to naive RAG
Outperforms the state-of-the-art proprietary reranker GTE-7B by 3.4% on NaturalQA despite being 20x smaller (335M parameters)
Achieves 83.4% accuracy on Fact Verification (FM2) with Qwen2.5-72B, compared to 73.6% for naive RAG

Breakthrough Assessment

8/10

Significant performance jumps over both naive RAG and massive state-of-the-art rerankers (GTE-7B) using a much smaller, efficiently trained model. The metric directly aligns retrieval with generation success.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation for Open-Domain QA and Fact Verification

Inputs: Query x

Outputs: Answer y

Pipeline Flow

Retriever (fetches top-k documents)
InfoGain Reranker (predicts DIG scores, reorders, and filters documents)
Generator (LLM produces answer using filtered context)

System Modules

Retriever

Retrieve initial candidate documents

Model or implementation: Contriever (dense retriever)

InfoGain Reranker

Predict DIG scores for documents and filter out those below a threshold

Model or implementation: RoBERTa-large (355M parameters)

Generator

Generate the final answer using the optimized context

Model or implementation: Various LLMs (e.g., Qwen, LLaMA, GPT-4)

Novel Architectural Elements

Integration of a specialized reranker trained specifically on 'Document Information Gain' (prediction of generation confidence delta) rather than semantic similarity labels

Modeling

Base Model: RoBERTa-large (for the reranker)

Training Method: Multi-task learning combining Cross-Entropy and Margin Ranking Loss

Objective Functions:

Purpose: Distinguish helpful documents from harmful/irrelevant ones via binary classification.

Formally: Cross-Entropy loss minimizing -log(p(y|x,d)) where labels are derived from DIG thresholds.
Purpose: Optimize the relative ordering of documents based on their DIG values.

Formally: Margin-based ranking loss (derived from Circle Loss) enforcing that positive document pairs score higher than negative pairs by a margin.

Training Data:

110K queries sampled from TriviaQA
DIG scores calculated using Qwen2.5-7B as the reference LLM
Triplets categorized into Positive (DIG > 0.5), Negative (DIG < -0.2), and Negligible (-0.05 to 0.05)
Final training set: 88K samples (balanced for CE loss, grouped for margin loss)

Key Hyperparameters:

learning_rate: 5e-6
batch_size: Not reported in the paper
loss_weight_beta: 0.75
+ 4 more
DIG_threshold_b1: 0.5
DIG_threshold_b2: -0.2
token_importance_k: 3
token_importance_weight: 0.8

Compute: Trained on a single A800 GPU

Comparison to Prior Work

vs. BGE/GTE: Optimizes for 'generation confidence gain' rather than semantic similarity or relevance labels
vs. Self-RAG: Uses a lightweight external reranker instead of expensive self-reflection calls during inference
vs. RePlug: Trains a reranker/filter rather than fine-tuning the retriever itself, allowing plug-and-play use with multiple retrievers

Limitations

Relies on a specific reference LLM (Qwen2.5-7B) to generate training labels (DIG scores), though experiments show transferability
Requires ground truth answers to calculate DIG for training data
Thresholds for document filtering (DIG scores) are hyperparameters that may need tuning

Reproducibility

Datasets (TriviaQA, NaturalQA, PopQA, FM2) and base models (RoBERTa, Qwen, LLaMA) are public. Code URL is not provided in the paper text. Exact training batch size and duration are not reported.

📊 Experiments & Results

Evaluation Setup

Open-domain QA and Fact Verification using retrieved documents from Wikipedia

Benchmarks:

TriviaQA (Open-domain QA)
NaturalQA (Open-domain QA)
PopQA (Long-tail QA)
FM2 (Fact-Checking) (Fact Verification)

Metrics:

Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
InfoGain-RAG consistently outperforms naive RAG, open-source BGE rerankers, and proprietary GTE rerankers across various LLM backbones on NaturalQA.
NaturalQA	Exact Match	40.3%	58.1%	+17.8%
NaturalQA	Exact Match	53.9%	58.1%	+4.2%
NaturalQA	Exact Match	35.8%	53.3%	+17.5%
Generalization to Fact Verification tasks shows significant improvements.
FM2	Exact Match	73.6%	83.4%	+9.8%
InfoGain-RAG outperforms self-reflective and retriever-optimization methods.
NaturalQA	Exact Match	49.5%	51.9%	+2.4%
NaturalQA	Exact Match	42.3%	54.3%	+12.0%

Experiment Figures

Performance comparison on multi-retriever settings (Contriever + BM25 + DPR) across four datasets.

Ablation study on the weight (beta) balancing Cross-Entropy loss and Margin Ranking loss.

Main Takeaways

Effectiveness across models: The reranker trained on DIG scores from one model (Qwen2.5-7B) generalizes effectively to improve generation for completely different models (LLaMA, GPT-4, Claude).
Efficiency vs. Performance: A 355M parameter model outperforms a 7B parameter state-of-the-art reranker, validating that the 'Information Gain' training signal is denser and more valuable than generic semantic training.
Multi-retriever robustness: InfoGain-RAG handles documents from multiple sources (Contriever, BM25, DPR) effectively, selecting the best content regardless of origin.
Filtering is crucial: The ability to filter out documents with negative DIG (misleading info) is a key driver of performance, not just reordering.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG) pipelines
Familiarity with cross-entropy loss and contrastive/ranking losses
Basic knowledge of LLM token probabilities and logits

Key Terms

DIG: Document Information Gain—a metric quantifying the change in an LLM's confidence for the ground truth answer when a document is added to the context

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

EM: Exact Match—whether the predicted answer is exactly identical to the ground-truth answer

GTE: General Text Embeddings—a state-of-the-art family of dense retrieval/reranking models

BGE: BAAI General Embedding—a popular open-source dense retrieval/reranking model

LogSumExp: Logarithm of the Sum of Exponentials—a mathematical function used to approximate the maximum of a set of numbers, often used in loss functions

Circle Loss: A loss function for metric learning that re-weights penalties based on the similarity scores of positive and negative pairs

Softplus: A smooth approximation of the ReLU function, used here to smooth the margin loss