Towards understanding systems trade-offs in retrieval-augmented generation model inference

📝 Paper Summary

Modularized RAG pipeline System efficiency optimization

This paper characterizes the systems-level performance overheads of RAG, revealing that retrieval can double Time-To-First-Token latency and that unoptimized index choices cause massive memory and throughput bottlenecks at scale.

Core Problem

While RAG improves LLM accuracy without retraining, it introduces severe systems performance penalties—including high latency, memory bloat, and throughput degradation—that are poorly understood and optimized.

Why it matters:

High infrastructure costs: Continuous retraining is impractical, making RAG essential, but RAG's own computational costs are rising unchecked.
Production viability: Naive RAG implementations can increase end-to-end latency to ~30 seconds, making them unusable for real-time applications.
Scalability limits: As knowledge stores grow to billions of chunks, unoptimized retrieval indices consume terabytes of memory, exceeding standard hardware capacities.

Concrete Example: When a user asks a question requiring updated knowledge, a standard RAG pipeline might take 30 seconds to respond if using a frequent retrieval stride (every 4 tokens), with the retrieval step alone consuming 41% of latency, destroying the user experience compared to a standard 500ms LLM response.

Key Novelty

Systems-Level Taxonomy and Characterization of RAG

Constructs a taxonomy of RAG systems focusing on hardware/software trade-offs: retrieval algorithms (HNSW vs IVF), integration strategies (frequency of retrieval), and runtime parameters (batching).
Provides a detailed breakdown of latency (TTFT vs end-to-end), throughput, and memory consumption across different retrieval index types and datastore scales.

Architecture

Conceptual workflow of a RAG pipeline distinguishing between offline and online stages.

Evaluation Highlights

Retrieval stages nearly double the Time-To-First-Token (TTFT) latency from 495ms (baseline LLM) to 965ms in RAG setups.
Scaling the datastore from 1 million to 100 million chunks degrades retrieval throughput by up to 20x.
Memory-efficient indices (IVF-PQ) reduce memory usage by 7.2x compared to HNSW-SQ but cap recall at ~0.6, illustrating a sharp accuracy-efficiency trade-off.

Breakthrough Assessment

7/10

Valuable systems characterization that quantifies often-overlooked overheads (TTFT, tail latency). While it doesn't propose a new architecture, it exposes critical bottlenecks for future optimization.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Generation inference pipeline optimization

Inputs: User query q and a large non-parametric datastore of document chunks

Outputs: Generated text response Ans based on retrieved context

Pipeline Flow

Offline: Chunking → Embedding → Index Construction (HNSW/IVF)
Online: Query Encoding → Retrieval (ANN Search) → Re-ranking → LLM Inference (Prefill + Decoding)

System Modules

Query Encoder (Retrieval & Selection)

Encodes input text into vector representation

Model or implementation: BGE Large

Retriever (Retrieval & Selection)

Performs similarity search over datastore to find relevant chunks

Model or implementation: FAISS (HNSW-SQ, IVF-PQ, or IVF-SQ indices)

Re-ranker (Retrieval & Selection)

Selects best chunks from retrieved set

Model or implementation: Inner-product distance ranking

Generator

Generates response conditioned on query + retrieved chunks

Model or implementation: GEMMA 2 (9B parameters)

Novel Architectural Elements

Exploration of variable retrieval strides (frequency of retrieval during generation) as a system design parameter impacting latency vs accuracy

Modeling

Base Model: GEMMA 2 9B

Comparison to Prior Work

vs. Standard LLM: Quantifies the specific overhead (latency doubling, tail latency spikes) introduced by RAG
vs. Prior RAG work: Focuses on systems characterization (memory/throughput trade-offs of indices) rather than just modeling accuracy

Limitations

Study limited to dense retrieval only (sparse retrieval not evaluated)
Experiments run on a specific hardware setup (single node), not distributed clusters
Did not evaluate complex query transformation or advanced neural re-ranking algorithms
Datastore scaling simulation stopped at 100M chunks for throughput tests (1B only for memory)

Reproducibility

Uses standard open-source libraries (FAISS, HuggingFace, vLLM) and models (BGE Large, GEMMA 2). Datastore built from Common Crawl subset (100M chunks). Hardware specifics (Intel Xeon Silver 4316, NVIDIA A6000 ADA) provided. Code URL not explicitly provided in paper text.

📊 Experiments & Results

Evaluation Setup

RAG Inference pipeline performance characterization

Benchmarks:

TriviaQA-test (Question Answering queries for retrieval evaluation)
Common Crawl Subset (Datastore construction (100M chunks)) [New]

Metrics:

Time-To-First-Token (TTFT)
End-to-End Latency
Retrieval Recall (@20)
Queries Per Second (QPS)
Index Memory Usage (GB/TB)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Latency analysis reveals significant overheads introduced by RAG components compared to standard LLM inference.
Custom Setup (GEMMA 2 9B)	Time-To-First-Token (TTFT)	495	965	470
Custom Setup (GEMMA 2 9B)	Retrieval Tail Latency Gap (p99 - p50)	Minimal (qualitative)	50	High
Index comparison highlights trade-offs between memory efficiency, recall, and throughput.
100M Chunk Datastore	Memory Usage Reduction Factor	1.0	7.2	6.2
100M Chunk Datastore	Maximum Recall	0.95	0.6	-0.35
100M Chunk Datastore	Throughput (QPS) at Batch=128+	300	110	-190

Experiment Figures

Breakdown of Time-To-First-Token (TTFT) and End-to-End latency across different RAG configurations.

Trade-offs between Recall, Latency, Throughput, and Memory for different index types (HNSW vs IVF) and datastore sizes.

Main Takeaways

Retrieval overhead is dominant: It accounts for 41% of end-to-end latency and ~45% of TTFT, doubling the initial wait time for users.
Re-retrieval is costly: Aggressive retrieval striding (every 4 tokens) explodes end-to-end latency to nearly 30 seconds, making it impractical for real-time use.
Throughput vs. Memory Trade-off: HNSW indices scale better with batch size (3x higher QPS than IVF) but require massive memory; IVF saves memory (7.2x less) but bottlenecks throughput due to computation intensity.
Scalability Wall: Increasing datastore size from 1M to 100M chunks degrades throughput by 20x, indicating severe scalability challenges for billion-scale production systems.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM inference phases (prefill vs. decoding)
Knowledge of Approximate Nearest Neighbor (ANN) search algorithms
Familiarity with vector databases and embeddings

Key Terms

TTFT: Time-To-First-Token—the latency from the moment a user sends a query until the model generates the first word of the response

HNSW: Hierarchical Navigable Small World—a graph-based approximate nearest neighbor search algorithm known for high speed and accuracy but high memory usage

IVF: Inverted File—a search index that clusters vectors to speed up search; more memory-efficient than HNSW but often less accurate

retrieval stride: The frequency at which the system performs a new retrieval operation during the generation process (e.g., retrieving new context every 4 tokens)

scalar quantization (SQ): A compression technique that reduces the precision of vector numbers (e.g., from 32-bit float to 8-bit integer) to save memory

product quantization (PQ): A compression technique that splits vectors into sub-vectors and quantizes them separately, offering higher compression than scalar quantization

recall: The fraction of relevant documents successfully retrieved by the system compared to the total number of relevant documents available

tail latency: The response time for the slowest percentage of requests (e.g., p99), often much higher than the average due to system stalls or complex queries

QPS: Queries Per Second—a measure of the throughput of the retrieval system

re-ranking: A second stage in retrieval where a more accurate (but slower) model re-scores the initial set of retrieved documents to improve relevance