BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering

📝 Paper Summary

Modularized RAG pipeline Retrieval

BlendFilter improves RAG performance by blending external and internal knowledge to augment queries, followed by a filtering step where the LLM itself selects only relevant retrieved documents before answering.

Core Problem

RAG systems struggle with complex questions that lack explicit keywords for retrieval, and retrieved documents often contain irrelevant noise that confuses the LLM.

Why it matters:

Simple queries for complex tasks often miss key information, leading to retrieval failure
Standard retrieval often fetches noisy, irrelevant documents that hallucinate or distract the model
Existing query augmentation methods typically rely on a single source (internal or external), limiting coverage

Concrete Example: For a multi-hop question about implicit sub-problems, a standard retriever might miss documents because the original query lacks specific keywords. Even if documents are found, top-K retrieval may include irrelevant text that misleads the final answer generation.

Key Novelty

Query Generation Blending + LLM-as-a-Filter

Generates three query variants: the original query, one augmented with external knowledge (via Chain-of-Thought), and one augmented with internal LLM knowledge
Uses the LLM itself to act as a semantic filter, reading retrieved documents from all three query streams and discarding irrelevant ones before final answer generation

Architecture

Overview of the BlendFilter framework, illustrating the query blending and knowledge filtering processes.

Evaluation Highlights

Outperforms state-of-the-art baselines on 2WikiMultihopQA by up to +6.81% (Exact Match) using Llama-2-7b-chat
Achieves significant gains on HotpotQA (+4.67% EM vs. Self-RAG) using Llama-2-13b-chat
Consistently improves performance across three different backbone models (Llama-2-7b, Llama-2-13b, GPT-3.5-turbo-Instruct)

Breakthrough Assessment

7/10

Strong empirical results and a logical combination of augmentation sources. The idea of using the LLM itself as a filter is effective but computationally expensive compared to lightweight classifiers.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering where an LLM answers query q using a knowledge base K

Inputs: Natural language query q

Outputs: Generated answer a

Pipeline Flow

Query Generation Blending: Generate augmented queries using external and internal knowledge
Retrieval: Retrieve documents for original and augmented queries
Knowledge Filtering: LLM filters irrelevant documents from each retrieval set
Answer Generation: LLM generates final answer using filtered knowledge

System Modules

External Knowledge Augmentor (Query Generation Blending)

Generate query augmented with external knowledge via 2-hop retrieval

Model or implementation: LLM (e.g., Llama-2-7b-chat)

Internal Knowledge Augmentor (Query Generation Blending)

Generate query augmented with internal knowledge

Model or implementation: LLM (e.g., Llama-2-7b-chat)

Retriever

Retrieve documents for all 3 query versions (original, external-aug, internal-aug)

Model or implementation: Contriever

Knowledge Filter

Select relevant documents from retrieval results

Model or implementation: LLM (e.g., Llama-2-7b-chat)

Answer Generator

Generate final answer using union of filtered knowledge

Model or implementation: LLM (e.g., Llama-2-7b-chat)

Novel Architectural Elements

Parallel multi-source query augmentation (blending internal and external knowledge)
LLM-based filtering stage applied independently to each retrieval stream before union

Modeling

Base Model: Llama-2-7b-chat, Llama-2-13b-chat, GPT-3.5-turbo-Instruct

Comparison to Prior Work

vs. Standard RAG: BlendFilter augments queries and filters noise
vs. Self-RAG: BlendFilter does not require fine-tuning or special tokens; uses standard LLM prompting for filtering
vs. IRCOT: BlendFilter uses a blending approach for query generation rather than iterative interleaving steps
+ 1 more
vs. RECOMP [not cited in paper]: RECOMP trains compressors/selectors, whereas BlendFilter prompts the frozen LLM to filter

Limitations

Inference cost is high due to multiple LLM calls (internal generation, external generation, filtering per stream)
Depends on the intrinsic capability of the LLM; smaller models might struggle with filtering or augmentation quality
Performance gains on single-hop QA (like PopQA) are marginal compared to complex multi-hop tasks

📊 Experiments & Results

Evaluation Setup

Open-domain QA using Wikipedia dump (Dec 2018)

Benchmarks:

HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
PopQA (Entity-centric QA)

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BlendFilter outperforms baselines significantly on multi-hop datasets (2Wiki, HotpotQA) using Llama-2-7b-chat.
2WikiMultihopQA	Exact Match (EM)	32.05	38.86	+6.81
HotpotQA	Exact Match (EM)	37.16	40.91	+3.75
Results using Llama-2-13b-chat show consistent improvements, confirming scalability.
HotpotQA	Exact Match (EM)	41.42	46.09	+4.67
2WikiMultihopQA	Exact Match (EM)	36.25	41.67	+5.42
Ablation studies demonstrate the contribution of each component (Internal Augmentation, External Augmentation, Filtering).
HotpotQA	Exact Match (EM)	39.11	40.91	+1.80
HotpotQA	Exact Match (EM)	39.95	40.91	+0.96
HotpotQA	Exact Match (EM)	36.56	40.91	+4.35

Experiment Figures

Impact of different K (number of retrieved documents) on performance (EM score) for HotpotQA and 2WikiMultihopQA.

Main Takeaways

External knowledge augmentation provides the largest performance gain, effectively handling complex query decomposition
Knowledge filtering is essential; removing it drops performance, validating that noise in retrieved documents harms generation
BlendFilter works across model sizes (7B, 13B) and types (Llama, GPT-3.5), showing robustness
The method is particularly effective for multi-hop questions (HotpotQA, 2Wiki) compared to entity-centric simple questions (PopQA)

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) concepts
Chain-of-Thought (CoT) prompting
Dense retrieval methods (e.g., Contriever)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

External Knowledge Augmentation: Using a first pass of retrieval to generate a preliminary answer/rationale, which is then appended to the query for a second, better retrieval pass

Internal Knowledge Augmentation: Asking the LLM to generate a preliminary answer based solely on its pre-trained memory, then appending that to the query for retrieval

Exact Match (EM): A metric measuring the percentage of predictions that match the ground truth answer exactly

F1 score: A metric balancing precision and recall measuring word overlap between prediction and ground truth

Contriever: A dense retrieval model used to encode queries and documents into vectors for similarity search

Self-RAG: A baseline method that uses self-reflection tokens to critique and filter retrieved content