Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

📝 Paper Summary

Modularized RAG pipeline Benchmark datasets Metrics and evaluation

Scaling the inference-time datastore size for retrieval-based language models monotonically improves performance and offers better compute-optimal trade-offs than scaling model parameters or pretraining data alone.

Core Problem

Current scaling laws focus on pretraining data and parameter counts, ignoring inference-time datastore size, while existing large datastores (like RETRO) are proprietary or lack comprehensive downstream evaluation.

Why it matters:

Training large LMs is prohibitively expensive; finding efficiency gains through retrieval could reduce compute costs
It is unknown how datastore scaling affects dominant retrieve-in-context approaches across diverse tasks beyond simple language modeling
Prior open-source datastores are small (Wiki-scale) or lack diverse domain coverage, limiting research on trillion-token retrieval

Concrete Example: A small Llama-2 7B model using a massive datastore outperforms a larger Llama-2 13B model (without retrieval) on knowledge-intensive tasks like TriviaQA, showing that external memory can substitute for model scale.

Key Novelty

Inference-Time Datastore Scaling Laws

Treats the size of the retrieval datastore as a primary scaling dimension alongside model size and pretraining data size
Demonstrates that indexing data for retrieval is more compute-efficient than training on it, allowing smaller models with large datastores to beat larger LM-only models
Introduces MassiveDS, a 1.4 trillion-token open-source datastore with diverse domains (code, math, science, web), and an efficient pipeline to study scaling without repeated expensive indexing

Architecture

Comparison between the naive datastore scaling pipeline and the proposed efficient MassiveDS pipeline.

Evaluation Highlights

Llama-2 7B with MassiveDS outperforms the larger Llama-2 13B LM-only baseline on TriviaQA and Natural Questions
Retrieval-based LMs achieve better compute-optimal performance than LM-only models, reaching lower perplexity/higher accuracy for the same training FLOPs
Datastore scaling shows no saturation up to 1.4T tokens for language modeling perplexity and knowledge-intensive QA tasks

Breakthrough Assessment

8/10

Provides the first comprehensive open-source study and dataset (MassiveDS) for trillion-token datastore scaling, establishing new scaling laws for RAG that challenge the 'scale parameters only' paradigm.

⚙️ Technical Details

Problem Definition

Setting: Retrieve-in-context Language Modeling (RIC-LM) where an LM generates output conditioned on retrieved documents from a scalable datastore

Inputs: Query x and a datastore D of size N

Outputs: Generated text y

Pipeline Flow

Datastore Construction (MassiveDS Pipeline)
Retrieval (Contriever)
Generation (RIC-LM)

System Modules

MassiveDS Pipeline

Efficiently simulate varying datastore sizes

Model or implementation: Python-based subsampling logic

Retriever

Identify top-k relevant documents from the datastore

Model or implementation: Contriever-MSMARCO (177M params)

Generator

Generate answer using query and retrieved context

Model or implementation: Llama-2 (7B, 13B), Llama-3, Pythia, OLMo

Novel Architectural Elements

Inverted pipeline execution for scaling studies: Retrieval (K>>k) is performed first on the full corpus, then subsampling/filtering are applied to retrieved sets to simulate smaller datastores without rebuilding indices.

Modeling

Base Model: Llama-2 (7B, 13B), Llama-3 (8B), Pythia (160M-12B), OLMo (1B, 7B)

Training Method: Inference-only RAG (Retrieve-in-Context)

Key Hyperparameters:

retrieval_chunk_size: 256 words
top_k_retrieval: 3
retriever_model_params: 177000000

Compute: Datastore construction: 1 forward pass per token (cheaper than training). Pretraining: 1 forward + 1 backward pass per token.

Comparison to Prior Work

vs. RETRO: MassiveDS is fully open-source (RETRO is proprietary); this paper evaluates downstream tasks (RETRO focused on perplexity)
vs. Sphere: MassiveDS is 15x larger (1.4T vs 90B) and includes diverse domains (code, math) beyond web data
vs. KILT/Wikipedia: MassiveDS scales to trillion tokens vs. billion-scale Wikipedia, enabling study of massive-scale retrieval effects

Limitations

Inference cost increases with retrieval due to longer context, though this is not analyzed in the compute-optimal scaling
Retrieval scaling shows mixed or negligible results on reasoning-heavy tasks like MMLU and MedQA
MassiveDS lacks specific textbooks or biomedical literature that might be needed for specialized reasoning tasks
Evaluation uses a fixed retriever (Contriever) rather than jointly training retriever and generator

Reproducibility

Code: https://github.com/RulinShao/retrieval-scaling

MassiveDS (1.4T tokens), embeddings, index, and code are open-sourced at https://github.com/RulinShao/retrieval-scaling. The paper uses public models (Llama, Pythia, OLMo). Exact compute resources (GPU hours) for the full study are not explicitly listed, but the efficient pipeline is described to reduce cost.

📊 Experiments & Results

Evaluation Setup

5-shot prompting with top-3 retrieved documents prepended to context

Benchmarks:

RedPajama (Web) (Language Modeling Perplexity)
S2ORC (Science) (Language Modeling Perplexity)
TriviaQA (TQA) (General Knowledge QA)
Natural Questions (NQ) (Open Domain QA)
MMLU (Multi-task Reasoning)
MedQA (Medical QA)

Metrics:

Perplexity
Exact Match (EM)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Language modeling perplexity consistently improves as datastore size increases, with no saturation observed up to 1.4T tokens.
RedPajama (Web)	Perplexity	5.5	4.2	-1.3
Downstream task performance on knowledge-intensive benchmarks shows small retrieval-augmented models outperforming larger LM-only models.
TriviaQA	Exact Match	0.68	0.72	+0.04
Natural Questions	Exact Match	0.28	0.36	+0.08
On reasoning-heavy tasks, the benefits of datastore scaling are less pronounced or require stronger base models.
MMLU	Accuracy	0.45	0.47	+0.02

Experiment Figures

Scaling curves for perplexity and downstream tasks across different models (Llama-2, Llama-3, etc.) as a function of datastore size.

Compute-optimal scaling curves comparing Retrieval-based LMs vs. LM-only models plotted against Training FLOPs.

Main Takeaways

Datastore scaling monotonically improves language modeling perplexity and knowledge-intensive QA performance without obvious saturation
Retrieval-augmented small models (e.g., 7B) can outperform significantly larger LM-only models (e.g., 13B) on factual tasks
Compute-optimal scaling analysis suggests that offloading FLOPs from pretraining to datastore construction yields better performance for the same budget
Reasoning-heavy tasks (MMLU, MedQA) benefit less from current datastore scaling, possibly due to the need for more specialized data or better reasoning capabilities in the base LM

📚 Prerequisite Knowledge

Prerequisites

Understanding of scaling laws (Kaplan et al., Chinchilla)
Retrieval-Augmented Generation (RAG) architectures
Dense retrieval and vector indexing

Key Terms

MassiveDS: A 1.4 trillion-token open-source datastore constructed from diverse web and domain-specific sources (books, code, papers) for retrieval scaling research

RIC-LM: Retrieve-in-context Language Models—models that augment generation by prepending retrieved documents to the input context without architectural modification

compute-optimal scaling: Analysis determining the best allocation of computational budget (FLOPs) between model size, pretraining data, and datastore size to maximize performance

Contriever: A dense retrieval model trained using contrastive learning to match queries with relevant documents

perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

FLOPs: Floating Point Operations—a measure of computer performance and computational cost used here to compare training vs. indexing efficiency

subsampling: The process of randomly selecting a fraction of the full datastore to simulate smaller datastore sizes for scaling analysis

reranking: A second stage in retrieval where a more expensive model re-scores the initial set of retrieved documents to improve relevance