RETRO: Improving language models by retrieving from trillions of tokens

📝 Paper Summary

Modularized RAG pipeline Large-scale Language Modeling

RETRO enhances autoregressive language models by retrieving document chunks from a massive database via a frozen BERT retriever and integrating them through chunked cross-attention.

Core Problem

Increasing language model size improves performance but couples computation with memorization, making it computationally expensive to scale knowledge and hard to update or inspect memory.

Why it matters:

Training large models (100B+ parameters) is prohibitively expensive in terms of energy and compute
Static training data leads to model obsolescence, and re-training to add new knowledge is costly
Large models are prone to hallucination and lack interpretability regarding the source of their factual assertions

Concrete Example: A standard 7B parameter transformer might fail to complete a specific quote or fact from a niche document not deeply embedded in its weights. RETRO can retrieve the exact chunk containing the quote from a 2-trillion token database at inference time to complete it accurately.

Key Novelty

Retrieval-Enhanced Transformer (RETRO)

Decouples memory from computation by accessing a 2-trillion token database via dense retrieval rather than storing all knowledge in model weights
Retrieves at the granularity of contiguous token chunks (64 tokens) rather than individual tokens or whole documents, enabling efficient scaling
Uses a chunked cross-attention mechanism to integrate retrieved neighbors into the autoregressive generation process while maintaining causality

Evaluation Highlights

RETRO 7.5B matches the performance of Jurassic-1 (178B) and GPT-3 on the Pile dataset despite using 25x fewer parameters
Outperforms baseline transformers of the same size across all scales (150M to 7B parameters) on C4 and Wikitext103
Achieves state-of-the-art perplexity on Wikitext103 (3.92) when retrieving from the full MassiveText database

Breakthrough Assessment

9/10

Demonstrates that massive-scale retrieval (trillions of tokens) can replace massive parameter counts (hundreds of billions), fundamentally changing the scaling laws for language modeling.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling conditioned on retrieved text chunks

Inputs: Sequence of tokens X split into chunks

Outputs: Next token probabilities

Pipeline Flow

Input Splitter (splits sequence into chunks)
Frozen BERT Retriever (retrieves k nearest neighbors for previous chunk)
Retrieval Encoder (encodes retrieved neighbors)
RETRO Decoder (integrates encoded neighbors via Chunked Cross-Attention to predict current chunk)

System Modules

Frozen BERT Retriever

Compute embeddings for input chunks and retrieve nearest neighbors from the key-value database

Model or implementation: Frozen BERT

Retrieval Encoder

Encode retrieved neighbors into dense representations, conditioned on the retrieving chunk

Model or implementation: Bi-directional Transformer (2 layers, 896 hidden size)

RETRO Decoder

Autoregressive generation of the current chunk using both previous tokens and encoded retrieved neighbors

Model or implementation: Transformer Decoder (interleaved RETRO blocks)

Novel Architectural Elements

Chunked Cross-Attention (CCA): A layer that attends to encoded retrieved neighbors, specifically designed to handle the alignment between input chunks and retrieved data while preserving autoregressivity
Split-Chunk Retrieval integration: Retrieval is performed per-chunk (64 tokens), and the CCA integrates this data into the decoder at specific layers
Retro-fitting: The ability to freeze a pre-trained standard transformer and only train the added CCA and encoder weights to convert it into a RETRO model

Modeling

Base Model: Transformer (Decoder-only, similar to GPT-2/3 but with RMSNorm and relative position encodings)

Training Method: Training from scratch or 'Retro-fitting' (fine-tuning frozen baselines)

Objective Functions:

Purpose: Minimize negative log-likelihood of the next token.

Formally: L(X|theta, D) = - sum log p(x_i | x_{<i}, RETRO_D(C))

Trainable Parameters: 7.5B parameters (largest model)

Training Data:

Training set: MassiveText (Web, Books, News, Wikipedia, GitHub)
Retrieval Database: 2 trillion tokens (chunks of 64 tokens)

Key Hyperparameters:

chunk_size: 64
neighbor_count_k: 2
retrieval_encoder_layers: 2
+ 2 more
retrieval_encoder_hidden_size: 896
CCA_interval: Every 3 blocks starting from layer 6

Compute: SCaNN retrieval takes ~10ms (amortized over chunk length). Inference cost is quadratic in sequence length (standard) plus linear cost for retrieval (negligible).

Comparison to Prior Work

vs. kNN-LM: Retrievals integrated via cross-attention (deep fusion) rather than probability interpolation; retrieves chunks vs. tokens
vs. RAG: Retrieves repeatedly throughout generation (per chunk) rather than once per prompt; scales to trillions of tokens
vs. Jurassic-1/Gopher: Achieves comparable performance with 25x fewer parameters by leveraging retrieval

Limitations

Retrieval database construction is computationally heavy (requires pre-computing BERT embeddings for trillions of tokens)
Potential for test set leakage is higher due to direct access to training data in the retrieval database
Performance on some tasks (e.g., math) does not improve as much, possibly due to poor retrieval relevance
Privacy concerns: model can directly copy training data if present in the retrieval database

Reproducibility

Code not provided. Training data (MassiveText) is internal, but components like C4 and The Pile are public. Uses SCaNN for retrieval. Pre-computed BERT embeddings required for the database.

📊 Experiments & Results

Evaluation Setup

Language modeling (perplexity/bpb) and downstream QA

Benchmarks:

The Pile (Language Modeling)
Wikitext103 (Language Modeling)
C4 (Language Modeling)
Natural Questions (Question Answering)

Metrics:

Bits-per-byte (bpb)
Perplexity
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The Pile (average)	bpb	0.78	0.68	-0.10
Wikitext103	Perplexity	22.96	3.92	-19.04
Wikitext103	Perplexity	16.12	18.97	+2.85
Natural Questions	Test Accuracy (Exact Match)	30.4	45.5	+15.1
Natural Questions	Test Accuracy (Exact Match)	44.5	45.5	+1.0

Main Takeaways

RETRO provides consistent performance gains across model scales (150M to 7B), acting like a constant multiplier to model size (~10x parameter efficiency)
Performance improves monotonically with the size of the retrieval database (from billions to trillions of tokens)
Increasing the number of neighbors (k) at inference time improves performance, even if trained with fewer neighbors (e.g., trained with k=2, evaluated with k=10)
Retro-fitting (fine-tuning frozen models with retrieval) works surprisingly well, reaching near scratch-trained performance with only 3% of pre-training compute

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Encoder-Decoder)
Autoregressive language modeling
K-Nearest Neighbors (kNN) search
Dense vector embeddings (BERT)

Key Terms

RETRO: Retrieval-Enhanced Transformer—the proposed architecture that retrieves text chunks to augment generation

Chunked Cross-Attention (CCA): A mechanism that allows the model to attend to retrieved text chunks corresponding to the current input chunk

SCaNN: Scalable Nearest Neighbors—a library for efficient vector similarity search used to query the massive database

MassiveText: A large multilingual text dataset (5 trillion tokens) used for training and constructing the retrieval database

bpb: Bits-per-byte—a metric for language modeling performance, independent of the tokenizer vocabulary size

leakage: When evaluation data is inadvertently present in the training set, artificially inflating performance scores

frozen retriever: Using a pre-trained embedding model (like BERT) that is not updated during the training of the main language model

The Pile: A diverse, open-source language modeling dataset consisting of 22 smaller datasets (e.g., PubMed, ArXiv, GitHub)

autoregressivity: The property where a model predicts the next step based solely on previous steps, maintaining causal order

DPR: Dense Passage Retrieval—a method using dual encoders to retrieve relevant documents for open-domain question answering

perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance