Accelerating Iterative Retrieval-augmented Language Model Serving with Speculation

📝 Paper Summary

Iterative Retrieval-Augmented Generation (Iterative RAG) LLM Serving Optimization Speculative Decoding

RaLMSpec accelerates iterative retrieval-augmented language models by speculatively retrieving from a local cache and verifying correctness via efficient batched queries to the external knowledge base.

Core Problem

Iterative RaLM approaches frequently query external knowledge bases during generation (e.g., every token or sentence), causing severe latency bottlenecks due to sequential retrieval overhead.

Why it matters:

Iterative RAG achieves higher generation quality than one-shot RAG but is often practically unusable due to prohibitive latency (e.g., KNN-LM retrieves per token)
Standard serving executes retrieval and generation sequentially; for exact dense retrievers, retrieval time often dominates end-to-end latency
Existing optimizations like automaton-augmented retrieval may compromise generation quality; lossless acceleration is needed

Concrete Example: In standard iterative RAG, generating a sentence might require 3 sequential retrieval steps (q0→DocA, q1→DocB, q2→DocA). This halts generation 3 times. RaLMSpec speculatively reuses DocA from a local cache for q2, allowing the model to proceed, and verifies all 3 queries in a single batched parallel step later.

Key Novelty

Speculative Retrieval with Batched Verification (RaLMSpec)

Leverages 'temporal/spatial locality': consecutive retrieval steps often return the same or adjacent documents, allowing a small local cache to act as a high-speed speculative retriever
Replaces sequential knowledge base queries with fast local cache lookups, running generation speculatively until a 'verification step' is triggered
Performs 'batched verification': sends accumulated queries to the external database in one parallel batch, correcting the output only if the speculation (local cache result) mismatches the ground truth

Architecture

The speculative retrieval pipeline showing the interaction between the language model, local cache, and external knowledge base.

Evaluation Highlights

Up to 2.39× speedup for document-level iterative RAG (GPT-2) using an Exact Dense Retriever (EDR) on Wiki-QA
Up to 7.59× speedup for token-level iterative RAG (KNN-LM) using an Exact Dense Retriever
Consistent speedups across 3 models (GPT-2, OPT, LLaMA-2) and 3 retriever types (Exact Dense, Approx Dense, Sparse), with provably identical model outputs

Breakthrough Assessment

7/10

Novel application of speculative execution to the *retrieval* component rather than just decoding. Provides significant, lossless speedups for a specific but high-cost class of models (iterative RAG).

⚙️ Technical Details

Problem Definition

Setting: Accelerating inference for iterative Retrieval-Augmented Language Models (RaLM) that interleave retrieval and generation steps

Inputs: Input tokens X, external corpus C, language model f(·)

Outputs: Generated text sequence X_out (provably identical to baseline non-speculative execution)

Pipeline Flow

Initialize Local Cache (populate with initial retrieval)
Speculative Loop: Retrieve from Local Cache → Generate Text → Repeat 's' times
Batched Verification: Retrieve ground truth for all 's' queries from Knowledge Base
Correction: If mismatch, rollback to first error and regenerate; update Local Cache

System Modules

Local Cache

Stores recently retrieved documents to serve as a fast, approximate retriever

Model or implementation: Key-Value Store (Vector representations or sparse stats)

Language Model

Generates text conditioned on context and (speculated) documents

Model or implementation: GPT-2, OPT, or LLaMA-2

External Retriever

Provides ground truth documents for verification

Model or implementation: DPR (Dense Passage Retriever) or BM25

OS3 Scheduler

Dynamically adjusts speculation stride 's'

Model or implementation: Mathematical optimization rule

Novel Architectural Elements

Application of speculative execution specifically to the retrieval bottleneck (Speculative Retrieval)
Batched Verification mechanism for RAG systems to exploit retrieval parallelism
OS3 (Optimal Speculation Stride Scheduler) which adapts stride length based on runtime estimation of retrieval accuracy and latency

Modeling

Base Model: GPT2-medium (345M), OPT-1.3B, LLaMA-2-7B/13B/70B

Training Method: None (Inference-only optimization)

Compute: Inference tested on Oracle Cloud VM.GPU.A10 (1x A10 GPU, 15 CPUs). LLaMA-2-70B tested on 4x A100-80G.

Comparison to Prior Work

vs. RaLMSeq: RaLMSpec replaces sequential operations with speculative parallel operations
vs. Alon et al. (2022): RaLMSpec guarantees identical output to the baseline (lossless), whereas Alon et al. approximates
vs. Speculative Decoding: RaLMSpec targets the retrieval bottleneck using a cache, not the generation bottleneck using a draft model

Limitations

Speedup is bottlenecked by the ratio of retrieval latency to generation latency; less effective if generation dominates (e.g., with fast approximate retrievers)
Depends on temporal/spatial locality; performance drops if the model constantly requires disjoint, new information not in cache
Asynchronous verification benefit is currently simulated due to Python GIL limitations
KNN-LM caching logic relies on heuristic spatial locality (fetching next 'n' entries) which may not always hold

Reproducibility

Code: https://github.com/JackFram/ralm-sys

Code publicly available at https://github.com/JackFram/ralm-sys. Uses standard datasets (Wiki-QA, etc.) and models (LLaMA-2, DPR via Pyserini). Experiments use simulated latency for asynchronous verification due to Python GIL limitations, but other measurements are wall-clock.

📊 Experiments & Results

Evaluation Setup

Document-level and Token-level Iterative RAG serving

Benchmarks:

Wiki-QA (Open-domain QA)
Web Questions (Open-domain QA)
Natural Questions (Open-domain QA)
Trivia QA (Open-domain QA)

Metrics:

End-to-end Latency (seconds)
Speed-up Ratio (vs. baseline RaLMSeq)
Statistical methodology: Mean and standard deviation over 5 independent runs reported for main latency results

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Document-Level Iterative RAG: RaLMSpec consistently outperforms sequential baselines, with highest gains on retrieval-heavy configurations (Exact Dense Retriever).
Average across QA datasets	Speed-up Ratio	1.0	2.39	+1.39
Average across QA datasets	Speed-up Ratio	1.0	1.75	+0.75
Average across QA datasets	Speed-up Ratio	1.0	1.77	+0.77
Token-Level Iterative RAG (KNN-LM): Massive speedups observed due to high frequency of retrieval operations.
Wiki-QA (KNN-LM)	Speed-up Ratio	1.0	3.88	+2.88
Wiki-QA (KNN-LM)	Speed-up Ratio	1.0	7.59	+6.59

Experiment Figures

Bar charts decomposing latency into Generation (G) and Retrieval (R) for RaLMSeq vs RaLMSpec across models and retrievers.

Speed-up ratios for KNN-LM serving across different 'k' values (neighbors) and stride sizes.

Main Takeaways

RaLMSpec achieves lossless acceleration: outputs are guaranteed to match the baseline exactly
Speedup is most significant when retrieval cost is high (Exact Dense Retriever) relative to generation cost
The Optimal Speculation Stride Scheduler (OS3) is critical for performance; fixed strides can hurt performance if mismatches occur frequently
Prefetching (Top-k cache update) improves speculation accuracy but has diminishing returns if 'k' is too large (increasing retrieval overhead)
Effectiveness generalizes across diverse architectures (GPT-2, OPT, LLaMA-2) and retrieval modalities (Dense/Sparse)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Speculative Decoding/Execution concepts
Knowledge of Dense vs. Sparse Retrieval methods

Key Terms

Iterative RAG: A generation process where the model retrieves new documents multiple times during the generation of a single response (e.g., every sentence or token)

KNN-LM: K-Nearest Neighbor Language Model—a token-level iterative RAG method that interpolates the LM's next-token distribution with a distribution from retrieved nearest neighbors

Speculative Retrieval: Predicting the result of a retrieval operation (using a local cache) to proceed with generation, postponing the actual expensive retrieval

Batched Verification: Checking the validity of multiple speculative steps simultaneously by sending a group of queries to the external retriever in parallel

Speculation Stride: The number of consecutive speculative steps performed before triggering a verification step

Prefetching: Populating the local cache with extra documents (top-k instead of top-1) during verification to increase the cache hit rate for future speculations

Spatial Locality: The tendency for consecutive retrieval queries to access adjacent documents in the knowledge base (relevant for KNN-LM)

Temporal Locality: The tendency for consecutive retrieval queries to access the exact same document repeatedly