Learning to Filter Context for Retrieval-Augmented Generation

📝 Paper Summary

Modularized RAG pipeline

FILCO improves retrieval-augmented generation by training a model to filter retrieved passages down to minimal supporting sentences using lexical and information-theoretic measures.

Core Problem

Imperfect retrieval systems often return irrelevant or distracting content, causing generation models to hallucinate or rely on spurious correlations even when correct answers are present.

Why it matters:

Retrieval precision is often low (e.g., <5.0 unigram precision on NQ), overwhelming models with noise.
Models over-utilize negative passages or get distracted by irrelevant sentences within positive passages.
Feeding full retrieved passages increases computational cost and prompt length compared to filtering.

Concrete Example: When asking 'When did the first train run in England?', a retriever finds a passage about the first railway in Belgium (1835) and England (1560s). Without filtering, the generator might be distracted by the 1835 date or other wagonway details. FILCO removes the irrelevant 1835 sentence, leaving only the 1560s sentence, helping the model answer correctly.

Key Novelty

Context Filtering via STRINC, LEXICAL, and CXMI measures (FILCO)

Train a dedicated Context Filter model to identify and select only useful sentences from retrieved passages before they reach the generator.
Create training data for the filter using three distinct oracle strategies: String Inclusion (exact match), Lexical Overlap (n-gram similarity), and Conditional Cross-Mutual Information (probability gain).
Filter context at a fine-grained sentence level rather than the coarse passage level used in prior work.

Architecture

The FILCO pipeline demonstrating the two-step process: filtering context and then generating the answer.

Evaluation Highlights

+8.6 EM improvement on NaturalQuestions using Llama-2-7B compared to full-context baselines.
+6.2 Accuracy improvement on FEVER (Fact Verification) using Flan-T5-XL by removing distracting non-evidential content.
Reduces prompt length by 44-64% across tasks while maintaining or improving generation performance.

Breakthrough Assessment

7/10

Strong empirical results on filtering efficacy and token reduction. The comparison of three filtering strategies (StrInc vs. CXMI) provides valuable insights for different task types.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation where a generator M_gen produces output o given query q and retrieved passages P.

Inputs: Input query q and a set of retrieved passages P = {p_i}.

Outputs: Filtered context t_pred (subset of sentences from P) and final generated output o.

Pipeline Flow

Retrieval (DPR returns top-K passages)
Context Filter (M_ctx predicts useful sentences)
Generator (M_gen produces answer using filtered context)

System Modules

Retriever (Retrieval & Selection)

Retrieve initial set of passages from Wikipedia

Model or implementation: Adversarial Dense Passage Retriever (DPR)

Context Filter (M_ctx) (Retrieval & Selection)

Select specific sentences from P that support the answer

Model or implementation: Flan-T5-XL (3B) or Llama-2-7B

Generator (M_gen)

Generate final answer/response

Model or implementation: Flan-T5-XL (3B) or Llama-2-7B

Novel Architectural Elements

Training a separate sequence-to-sequence model solely for sentence-level context filtering based on information-theoretic signals (CXMI) rather than just semantic similarity.

Modeling

Base Model: Flan-T5-XL (3B) and Llama-2-7B

Training Method: Supervised Fine-Tuning (SFT) for both Filter and Generator

Objective Functions:

Purpose: Train filter to replicate oracle sentence selection.

Formally: M_ctx(t_silver | q + P)
Purpose: Train generator to produce answer given filtered context.

Formally: M_gen(o | t_silver + q)

Adaptation: LoRA (for Llama-2); Standard Fine-tuning (for Flan-T5)

Training Data:

Oracle filtering for training M_ctx constructed using three strategies: STRINC (exact string match), LEXICAL (unigram F1 > 0.5), CXMI (probability gain > 1.0).
Spans split using spaCy sentence tokenizer.

Key Hyperparameters:

learning_rate: 5e-5
batch_size: 32
epochs: 3
+ 3 more
max_input_length: 1024
lexical_threshold: 0.5
cxmi_threshold: 1.0

Compute: Reduced input length by 44-64% at inference.

Comparison to Prior Work

vs. Evidentiality-guided: FILCO filters at sentence granularity vs. passage granularity.
vs. RAG/FiD: FILCO explicitly filters context before generation to remove noise vs. utilizing attention mechanisms over full context.
vs. Decontextualization [not cited in paper]: FILCO selects existing sentences rather than rewriting/decontextualizing them.

Limitations

Experiments limited to Wikipedia-based datasets (Open Domain).
Relies on automatic metrics (EM, F1) which may not fully capture generation quality for long-form tasks.
Requires training two separate models (Filter and Generator), adding complexity compared to end-to-end approaches.

Reproducibility

Code: https://github.com/zorazrw/filco

Code available at https://github.com/zorazrw/filco. Uses standard datasets (KILT versions of NQ, TQA, HotpotQA, etc.) and standard models (Flan-T5, Llama-2).

📊 Experiments & Results

Evaluation Setup

Open-domain QA, Fact Verification, and Dialog Generation using Wikipedia passages.

Benchmarks:

NaturalQuestions (NQ) (Open-Domain QA)
TriviaQA (TQA) (Open-Domain QA)
HotpotQA (Multi-hop QA)
ELI5 (Long-Form QA)
FEVER (Fact Verification)
Wizard of Wikipedia (WoW) (Knowledge-Grounded Dialog)

Metrics:

Exact Match (EM)
F1 Score
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on single-passage retrieval setting using Llama-2-7B. FILCO generally outperforms full context (FULL) and passage-level filtering (PSG).
NaturalQuestions (NQ)	EM	34.7	43.3	+8.6
FEVER	Accuracy	82.3	86.6	+4.3
HotpotQA	F1	58.2	59.5	+1.3
TriviaQA (TQA)	EM	60.5	60.7	+0.2
Performance comparison on multiple-passage (Top-5) setting using Flan-T5-XL. FILCO shows robust gains over standard baselines.
FEVER	Accuracy	88.1	91.4	+3.3
NaturalQuestions (NQ)	EM	48.3	61.8	+13.5
Wizard of Wikipedia (WoW)	F1	64.8	66.0	+1.2

Experiment Figures

Bar charts comparing Full, Passage-level filtering (Psg), FILCO, and Silver context generation performance across all datasets.

Bar chart showing average number of input tokens for Full, Psg, and FILCO methods.

Main Takeaways

FILCO consistently outperforms full-context and passage-filtering baselines across extractive QA, abstractive QA, and dialog tasks.
Different filtering strategies work best for different tasks: STRINC is best for extractive QA, LEXICAL for dialog, and CXMI for complex/abstractive tasks.
Sentence-level filtering reduces input token count by 44-64%, improving efficiency.
Filtering improves performance even when retrieved passages are negative, likely by removing misleading noise.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Conditional Cross-Mutual Information (CXMI)
Encoder-Decoder and Decoder-only Transformer architectures

Key Terms

FILCO: FILter COntext—the proposed method to select useful sentences from retrieved passages.

CXMI: Conditional Cross-Mutual Information—a measure of how much more likely the generator is to produce the correct output when context is provided vs. when it is not.

STRINC: String Inclusion—a binary measure of whether a text span lexically contains the exact output string.

LEXICAL: A measure based on unigram overlap (F1 score) between the candidate text span and the target output (or query).

RAG: Retrieval-Augmented Generation—providing external documents to a language model to assist in answering questions.

EM: Exact Match—evaluation metric checking if the generated answer is identical to the ground truth.

DPR: Dense Passage Retriever—a retrieval system using dense vector representations to find relevant documents.

spurious memorization: When a model learns to rely on irrelevant patterns or accidental correlations in the data rather than true causal relationships.