REFRAG: Rethinking RAG based Decoding

📝 Paper Summary

Modularized RAG pipeline Context Compression

REFRAG accelerates RAG inference by feeding pre-computed, compressed chunk embeddings directly into the decoder instead of raw tokens, using a reinforcement learning policy to selectively expand only critical chunks.

Core Problem

RAG systems suffer from high latency (Time-To-First-Token) and memory usage because they process long, concatenated retrieval contexts where many tokens are irrelevant or redundant.

Why it matters:

Long contexts increase KV cache memory linearly and TTFT quadratically, limiting throughput for web-scale applications
Current methods treat RAG contexts as generic text, ignoring the unique block-diagonal sparsity where retrieved passages are often unrelated to each other
Repeatedly encoding the same retrieved passages for different queries is computationally wasteful

Concrete Example: In a RAG system retrieving 10 passages, a standard LLM must re-process all tokens of every passage for every new query. REFRAG pre-computes embeddings for these passages once; for a new query, it feeds compact embeddings (e.g., 1 embedding per 16 tokens) to the decoder, expanding only the few chunks strictly necessary for the answer.

Key Novelty

Compress-Then-Select Decoding Framework

Replaces raw context tokens with pre-computed chunk embeddings from a lightweight encoder, reducing the decoder's input sequence length by factors like 16x or 32x
Employ a 'compress anywhere' mechanism that allows the decoder to mix compressed embeddings and raw tokens seamlessly
Uses a lightweight Reinforcement Learning policy to decide dynamically which chunks to keep compressed and which to expand to full tokens for accuracy

Architecture

The REFRAG architecture illustrating the compression and decoding process.

Evaluation Highlights

Achieves 30.85x speedup in Time-To-First-Token (TTFT) compared to LLaMA-2-7B with a compression rate of 32
Maintains perplexity comparable to full-context LLaMA while being 3.75x faster than the previous state-of-the-art compression method (CEPE)
Extends effective context window by 16x, outperforming LLaMA on downstream RAG tasks by utilizing more retrieved passages within the same latency budget

Breakthrough Assessment

8/10

Significant practical speedup (30x) for RAG without architectural changes to the base LLM. The ability to mix compressed and raw tokens via RL is a clever, effective solution to the compression-accuracy trade-off.

⚙️ Technical Details

Problem Definition

Setting: Efficient decoding of Large Language Models conditioned on long retrieved contexts

Inputs: Query tokens q and a set of retrieved context passages

Outputs: Generated response tokens y

Pipeline Flow

Offline: Chunking & Encoding → Compressed Embeddings Store
Online: Query → RL Policy (Selects Compress vs. Expand) → Mixed Input Construction → Decoder → Answer

System Modules

Context Encoder (Input Processing)

Compress text chunks into compact vector embeddings

Model or implementation: RoBERTa-Base or RoBERTa-Large

Projection Layer (Input Processing)

Map encoder embeddings to the decoder's token embedding space

Model or implementation: Linear Projection

Selection Policy

Decide which chunks to keep compressed and which to expand to raw tokens

Model or implementation: Lightweight RL Policy Network

Decoder

Generate the answer using the mixed sequence of query tokens and (mostly) compressed chunk embeddings

Model or implementation: LLaMA-2-7B or LLaMA-2-13B

Novel Architectural Elements

Input mixing mechanism allowing the decoder to accept a sequence containing both standard token embeddings and projected chunk embeddings at arbitrary positions
Reinforcement learning loop specifically designed to optimize the discrete decision of compression vs. expansion for RAG contexts

Modeling

Base Model: LLaMA-2-7B (Decoder), RoBERTa (Encoder)

Training Method: Continual Pre-training (CPT) followed by Supervised Fine-Tuning (SFT)

Objective Functions:

Purpose: Align encoder and decoder to reconstruct tokens from embeddings.

Formally: Reconstruction task (freeze decoder, train encoder/projector)
Purpose: Adapt model to predict next tokens using compressed context.

Formally: Next paragraph prediction during CPT
Purpose: Optimize selection policy.

Formally: RL objective using negative perplexity as reward

Training Data:

SlimPajama dataset (20B tokens sampled from Book and ArXiv domains)
Instruction tuning datasets: OpenAssistant, MS MARCO, SQuADv2, etc. (1.1 million data points)

Key Hyperparameters:

context_length_s: 2048
output_length_o: 2048
chunk_size_k: 16 or 32
+ 1 more
training_tokens: 20 Billion

Compute: Not reported in the paper

Comparison to Prior Work

vs. CEPE: REFRAG supports 'compress anywhere' (flexible positioning) and uses RL for selective expansion, achieving 3.75x higher speedup
vs. REPLUG: REFRAG focuses on latency reduction via embedding compression rather than just ensemble retrieval
vs. AutoCompressors [not cited in paper]: REFRAG integrates selective expansion via RL to recover information loss, rather than relying solely on fixed compression

Limitations

Requires continual pre-training (20B tokens) to align the encoder and decoder, which is resource-intensive
Performance regression observed at very high compression rates (e.g., k=64)
Current evaluation is primarily on LLaMA-2 architectures; generalization to other families requires verifying the CPT recipe

Reproducibility

Code: https://github.com/facebookresearch/refrag

Code will be available at https://github.com/facebookresearch/refrag. Uses public datasets (SlimPajama, KILT, etc.) and public base models (LLaMA-2, RoBERTa). Pre-computed embeddings strategy implies storage requirements not explicitly detailed.

📊 Experiments & Results

Evaluation Setup

Language modeling (perplexity) and Downstream RAG tasks

Benchmarks:

SlimPajama (Book/ArXiv) (Long-context Language Modeling)
KILT (NQ, FEVER, etc.) (Knowledge Intensive Language Tasks)
Multi-turn Conversation (TopiOCQA, ORConvQA, QReCC)

Metrics:

Perplexity
Time-To-First-Token (TTFT)
Throughput (tokens/sec)
Accuracy / Exact Match / F1
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Latency and throughput experiments demonstrate significant speedups over LLaMA-2 and the strong baseline CEPE.
Internal Benchmarking (Context=16384)	Speedup vs LLaMA-2-7B	2.01	16.53	+14.52
Internal Benchmarking (Context=16384)	Speedup vs LLaMA-2-7B	1.04	8.59	+7.55
Internal Benchmarking (Context=16384)	Speedup vs LLaMA-2-7B	1.0	32.99	+31.99
Downstream RAG performance shows REFRAG can match or beat full-context models by enabling larger effective contexts within the same latency budget.
Average over 16 RAG tasks	Accuracy Gain (Strong Retriever)	0.0	1.22	+1.22
SlimPajama (Book/ArXiv)	Log-perplexity improvement	0.0	9.3	+9.3

Experiment Figures

Comparison of TTFT speedup vs Context Length for LLaMA, CEPE, and REFRAG.

Perplexity comparison of different selective compression policies (Random, Heuristic, RL).

Main Takeaways

Curriculum learning is critical for the reconstruction task; without it, the encoder fails to learn effective compression for multiple chunks.
RL-based selective compression consistently outperforms heuristic strategies (perplexity-based selection) and random selection, allowing REFRAG to recover performance by expanding only the most relevant chunks.
Under equal latency constraints, REFRAG outperforms standard LLaMA on RAG tasks because it can process significantly more retrieved passages (e.g., 8 vs 1) for the same computational cost.
The method scales effectively to longer contexts (16k), maintaining perplexity improvements where standard LLaMA-2 (limited to 4k) would fail or require truncation.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
Retrieval-Augmented Generation (RAG) pipelines
KV Cache and its impact on inference latency
Reinforcement Learning (basics of policy optimization)

Key Terms

TTFT: Time-To-First-Token—the latency required to process the input prompt and generate the first token of the response

TTIT: Time-To-Inter-Token—the latency between generating subsequent tokens

KV Cache: Key-Value Cache—memory storage for intermediate attention representations in Transformers to avoid re-computation

Block-diagonal attention: An attention pattern where tokens primarily attend to their local neighborhood (e.g., within a passage) rather than globally across all passages

Curriculum learning: A training strategy where the model starts with easy tasks (reconstructing 1 chunk) and gradually moves to hard tasks (reconstructing L chunks)

CPT: Continual Pre-training—further training a base model on specific data (here, for compression alignment) before fine-tuning

SFT: Supervised Fine-Tuning—training the model on labeled task data

Perplexity: A measurement of how well a probability model predicts a sample; lower is better