Accelerating inference of retrieval-augmented generation via sparse context selection

📝 Paper Summary

Modularized RAG pipeline Efficient Inference

Sparse RAG accelerates retrieval-augmented generation by encoding retrieved documents in parallel and dynamically filtering out irrelevant contexts during decoding using an integrated relevance assessment mechanism.

Core Problem

Standard RAG concatenates all retrieved documents into the input, causing latency to grow linearly with the number of documents (and quadratically for attention) while often including irrelevant noise.

Why it matters:

Including many documents (e.g., 20+) causes dramatic latency increases, making real-time RAG applications on resource-constrained devices (like mobile phones) impractical
Existing solutions like Fusion-in-Decoder (FiD) are incompatible with decoder-only LLMs, while Parallel Context Windows (PCW) speeds up pre-filling but still suffers from slow decoding due to full cache usage
Reliance on external classifiers for filtering adds complexity and latency due to extra model calls

Concrete Example: When a standard RAG system retrieves 20 documents, it must attend to all of them during every token generation step. If only 2 are relevant, the model wastes significant compute attending to 18 irrelevant documents, slowing down generation and potentially hallucinating based on noise.

Key Novelty

Dual-Task Integrated Sparse Retrieval

Encodes retrieved documents in parallel (like PCW) to eliminate long-range cross-attention overhead during the pre-fill stage
Integrates a 'Per Context Assessment' (PCA) task where the LLM scores the relevance of each document to the query within the same forward pass
Selectively loads only the Key-Value (KV) caches of highly relevant documents for the decoding stage, significantly reducing the active context size

Architecture

Overview of the Sparse RAG inference process, comparing dense retrieval attention with the proposed sparse mechanism.

Evaluation Highlights

Achieves 2x to 3x faster decoding speed compared to standard Dense RAG and PCW-RAG on mobile devices (Samsung S21 Ultra)
Maintains or improves generation quality (e.g., +1.89% F1 on PopQA, +2.67% F1 on AmbigQA vs baselines) by effectively filtering noise
Outperforms Corrective-RAG (CRAG) using an efficient 'in-place' classifier rather than an external T5 model

Breakthrough Assessment

7/10

A strong practical engineering contribution that solves the specific bottleneck of RAG latency on decoder-only models without sacrificing quality. It effectively combines parallel encoding with dynamic sparsity.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering and summarization using retrieval-augmented LLMs

Inputs: Query Q and a set of retrieved contexts {C_1, ..., C_N}

Outputs: Generated response A

Pipeline Flow

Parallel Pre-fill: Encode Query + Each Context independently
Relevance Scoring: Model predicts 'Good'/'Bad' for each context
Cache Filtering: Select top-K KV caches based on score > threshold
Sparse Decoding: Generate answer attending only to Query + Selected Contexts

System Modules

Parallel Encoder / Scorer (Encoding & Selection)

Encodes contexts in parallel and scores relevance

Model or implementation: Gemini (finetuned with LoRA)

Cache Filter (Encoding & Selection)

Filters out irrelevant context caches

Model or implementation: Threshold Logic

Sparse Decoder

Generates final answer using only selected caches

Model or implementation: Gemini (finetuned with LoRA)

Novel Architectural Elements

Integrated PCA (Per Context Assessment) logic within the standard decoder pre-fill pass
Dynamic KV-cache dropping mechanism between pre-fill and decode stages based on internal confidence scores

Modeling

Base Model: Gemini (Nano/XXS sizes utilized for on-device experiments)

Training Method: Multi-task Fine-tuning (Assessment + Generation)

Objective Functions:

Purpose: Train model to classify context relevance.

Formally: Standard next-token prediction loss on 'Rate' tokens ('Good'/'Bad') given {Question}{Context}{Control_Assessment}
Purpose: Train model to generate answers.

Formally: Standard next-token prediction loss on 'Answer' tokens given {Question}{Selected_Contexts}{Control_Generation}

Adaptation: LoRA (Rank=4, applied to self-attention)

Training Data:

PopQA (14k pairs), QMSum (250 train / 77 test)
Missing relevance labels generated via 'Gemini + PaLM2' critique loop

Key Hyperparameters:

learning_rate: 0.003
batch_size: 64
optimizer: Adafactor
+ 1 more
dropout: 0.05

Compute: Training: 64 TPU V3 (PopQA) / 128 TPU V3 (QMSum). Inference evaluation: Samsung S21 Ultra CPU.

Comparison to Prior Work

vs. PCW: Sparse RAG filters caches *before* decoding, reducing memory/compute, whereas PCW keeps all caches.
vs. CRAG: Sparse RAG uses an 'internal' classifier (same model pass), reducing overhead compared to CRAG's external model calls.
vs. RECOMP [not cited in paper]: Compresses context into summary vectors, whereas Sparse RAG selects discrete KV caches.

Limitations

Requires fine-tuning (LoRA) to enable the assessment capability; not a plug-and-play inference strategy for off-the-shelf models.
Performance depends on the quality of the 'silver' labels used to train the internal assessor.
Threshold sensitivity: strict filtering might drop relevant context if the internal score is calibrated poorly.

Reproducibility

Code availability not provided. Uses proprietary Gemini models as base. Uses standard datasets (PopQA, QMSum). LoRA details provided. Prompt templates for assessment are in Appendix.

📊 Experiments & Results

Evaluation Setup

Open-domain QA (PopQA) and Meeting Summarization (QMSum) on mobile CPU

Benchmarks:

PopQA (Entity-centric QA)
QMSum (Query-based meeting summarization)

Metrics:

Exact Match (EM)
F1 score
RougeLSum
Decoding Speed (tokens/second)
Encoding Speed (tokens/second)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Sparse RAG achieves higher generation quality (F1/EM) compared to standard Dense RAG and PCW baselines on PopQA.
PopQA	F1	46.3	49.1	+2.8
PopQA	Decoding Speed (t/s)	1.78	4.15	+2.37
Results on QMSum (Long-form generation) show Sparse RAG maintains quality while drastically improving speed.
QMSum	RougeLSum	30.4	30.9	+0.5
QMSum	Decoding Speed (t/s)	1.71	5.71	+4.00
Comparison against Corrective-RAG (CRAG) shows the internal classifier matches or beats external classifier approaches.
PopQA	F1	48.2	49.1	+0.9

Experiment Figures

Impact of context number and output length on decoding speed.

Main Takeaways

Sparse RAG enables >2x decoding speedups on mobile devices by effectively reducing the number of attended documents during generation.
The method improves generation quality by filtering out noise; keeping only ~8/20 docs for PopQA and ~4.5/20 for QMSum yielded better F1/Rouge scores.
Internal relevance assessment (using the same LLM) outperforms external classifiers (like T5 in CRAG) in both efficiency and accuracy when trained with high-quality distilled labels.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms, KV caching)
Retrieval-Augmented Generation (RAG) standard flow
Decoder-only LLM inference bottlenecks (pre-fill vs. decoding)

Key Terms

KV cache: Key-Value cache—storing intermediate attention representations to avoid recomputing them for every token during generation

PCW: Parallel Context Windows—a method where contexts are encoded independently (no cross-attention between them) to speed up processing

FiD: Fusion-in-Decoder—an architecture that encodes passages independently and decodes jointly, originally for encoder-decoder models

Decoder-only: LLM architectures like GPT or Llama that use only a decoder stack, typically utilizing causal masking

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

Per Context Assessment (PCA): The auxiliary task introduced in this paper where the model predicts a relevance score (e.g., probability of token 'Good') for each retrieved document

BM25: A standard probabilistic information retrieval function used to rank documents based on keyword matching

Exact Match (EM): Evaluation metric measuring if the generated answer is identical to the ground truth

RougeLSum: Evaluation metric for summarization measuring the overlap of longest common subsequences between generated and reference summaries