← Back to Paper List

RETRO: Improving language models by retrieving from trillions of tokens

(Deepmind) Sebastian Borgeaud, etc.
DeepMind
arXiv, 12/2021 (2022)
RAG Pretraining Memory QA Factuality

📝 Paper Summary

Modularized RAG pipeline Large-scale Language Modeling
RETRO enhances autoregressive language models by retrieving document chunks from a massive database via a frozen BERT retriever and integrating them through chunked cross-attention.
Core Problem
Increasing language model size improves performance but couples computation with memorization, making it computationally expensive to scale knowledge and hard to update or inspect memory.
Why it matters:
  • Training large models (100B+ parameters) is prohibitively expensive in terms of energy and compute
  • Static training data leads to model obsolescence, and re-training to add new knowledge is costly
  • Large models are prone to hallucination and lack interpretability regarding the source of their factual assertions
Concrete Example: A standard 7B parameter transformer might fail to complete a specific quote or fact from a niche document not deeply embedded in its weights. RETRO can retrieve the exact chunk containing the quote from a 2-trillion token database at inference time to complete it accurately.
Key Novelty
Retrieval-Enhanced Transformer (RETRO)
  • Decouples memory from computation by accessing a 2-trillion token database via dense retrieval rather than storing all knowledge in model weights
  • Retrieves at the granularity of contiguous token chunks (64 tokens) rather than individual tokens or whole documents, enabling efficient scaling
  • Uses a chunked cross-attention mechanism to integrate retrieved neighbors into the autoregressive generation process while maintaining causality
Evaluation Highlights
  • RETRO 7.5B matches the performance of Jurassic-1 (178B) and GPT-3 on the Pile dataset despite using 25x fewer parameters
  • Outperforms baseline transformers of the same size across all scales (150M to 7B parameters) on C4 and Wikitext103
  • Achieves state-of-the-art perplexity on Wikitext103 (3.92) when retrieving from the full MassiveText database
Breakthrough Assessment
9/10
Demonstrates that massive-scale retrieval (trillions of tokens) can replace massive parameter counts (hundreds of billions), fundamentally changing the scaling laws for language modeling.
×