Atlas: Few-shot Learning with Retrieval Augmented Language Models

📝 Paper Summary

Modularized RAG pipeline Few-shot learning

Atlas is a retrieval-augmented language model that achieves state-of-the-art few-shot performance on knowledge-intensive tasks by jointly pre-training the retriever and reader components.

Core Problem

Large language models require massive parameter counts to store knowledge for few-shot tasks, but it is unclear if this memorization is strictly necessary or efficient.

Why it matters:

Scaling parameter count for knowledge storage is computationally expensive and inefficient compared to retrieving from external memory
Prior retrieval-augmented models had not demonstrated compelling few-shot learning capabilities comparable to closed-book LLMs like GPT-3
Decoupling memorization from reasoning allows for smaller, more updateable, and interpretable models

Concrete Example: For the query 'Where is the Bermuda Triangle?', a closed-book model must rely on internal weights. If it hasn't memorized the fact, it fails. Atlas retrieves documents about the Bermuda Triangle from a corpus and uses them to generate the answer 'Western part of the North Atlantic Ocean', achieving high accuracy with 50x fewer parameters than PaLM.

Key Novelty

Joint Pre-training for Few-shot Retrieval Augmented Generation

Jointly pre-trains a dense retriever and a sequence-to-sequence reader using unsupervised objectives like Perplexity Distillation, allowing the retriever to learn what documents help the language model
Demonstrates that retrieval-augmented models can match or beat massive closed-book LLMs (like PaLM 540B) on few-shot tasks with significantly fewer parameters (11B)
Investigates efficient fine-tuning strategies like query-side fine-tuning and re-ranking to handle index freshness without full re-indexing

Architecture

Overview of the Atlas framework, showing the retrieval-augmented workflow during pre-training (Masked Language Modeling) and few-shot fine-tuning (QA/Fact Checking).

Evaluation Highlights

Atlas-11B achieves 42.4% accuracy on Natural Questions with only 64 training examples, outperforming PaLM-540B (39.6%) despite having 50x fewer parameters
Sets new state-of-the-art on full-dataset Natural Questions (64.0%) and TriviaQA (84.7%), surpassing prior bests by over 8 points
On 5-shot MMLU, Atlas-11B achieves 43.4% accuracy (47.9% with de-biasing), outperforming GPT-3's 43.9% while using 15x fewer parameters

Breakthrough Assessment

9/10

Establish Atlas as the standard for retrieval-augmented few-shot learning. It convincingly demonstrates that retrieval can replace massive parameter scale for knowledge tasks, outperforming models 50x its size.

⚙️ Technical Details

Problem Definition

Setting: Text-to-text generation conditioned on retrieved documents

Inputs: Text query q (e.g., question or claim)

Outputs: Text output a (e.g., answer or verdict)

Pipeline Flow

Retriever (Retrieves top-k documents based on query)
Reader (Processes query + documents to generate output)

System Modules

Retriever

Retrieve relevant documents from the index

Model or implementation: Contriever (Dual-encoder based on BERT-base)

Language Model (Reader)

Generate the answer conditioning on retrieved documents

Model or implementation: T5 sequence-to-sequence (Fusion-in-Decoder architecture)

Novel Architectural Elements

Joint pre-training loop where the retriever learns from the language model's signals (Perplexity Distillation) using unsupervised pretext tasks like MLM
Integration of Contriever (unsupervised dense retriever) as the backbone for the retrieval module

Modeling

Base Model: T5 (Encoder-Decoder) for Reader, BERT-base for Retriever

Training Method: Joint pre-training followed by few-shot or full fine-tuning

Objective Functions:

Purpose: Train retriever to select documents that improve LM perplexity.

Formally: Minimize KL-divergence between retriever distribution p_retr(d|q) and LM posterior p_LM(a|d,q) (Perplexity Distillation).
Purpose: Pre-text task for joint training.

Formally: Masked Language Modeling (MLM) where spans are masked in the query and the model generates them.
Purpose: Alternative retriever loss (ADist).

Formally: Distill aggregate cross-attention scores from the LM into the retriever.
Purpose: Alternative retriever loss (EMDR2).

Formally: Expectation-Maximization treating documents as latent variables.

Adaptation: Full fine-tuning or Query-side fine-tuning (updating only query encoder)

Trainable Parameters: From 770M to 11B parameters (T5-Large to T5-XXL sizes)

Training Data:

Pre-training: Wikipedia (Dec 2021) + Common Crawl (2020-10)
Index: 37M Wikipedia passages + 350M Common Crawl passages

Key Hyperparameters:

learning_rate_reader: 1e-4 (pre-training), 4e-5 (fine-tuning)
learning_rate_retriever: 1e-5 (pre-training)
batch_size: 64 (ablation), 128 (final models)
+ 2 more
retrieved_documents_k: 20
optimizer: AdamW

Compute: Requires re-indexing or re-ranking during training; index refresh every 1000-2500 steps adds ~30% overhead compared to fixed retriever

Comparison to Prior Work

vs. PaLM/GPT-3: Atlas uses retrieval to offload knowledge, achieving better few-shot performance with far fewer parameters (11B vs 540B/175B)
vs. REALM: Atlas uses a seq2seq reader (T5) and dense Contriever instead of extractive BERT; demonstrates few-shot capabilities not shown in REALM
vs. Retro: Atlas focuses on few-shot joint training and distinct document processing (FiD) rather than chunk-level integration in the attention mechanism

Limitations

Computational overhead of refreshing the document index during training
Depends on the quality and coverage of the retrieval corpus (Wikipedia/Common Crawl)
Inference is slower than closed-book models due to the retrieval step and processing multiple documents
Potential for the model to ignore retrieved context if pre-training isn't carefully balanced

Reproducibility

Code: https://github.com/facebookresearch/atlas

Code, pretrained Atlas checkpoints, and supporting data are publicly available at https://github.com/facebookresearch/atlas. The paper uses standard T5 and BERT architectures but requires significant compute for pre-training and managing the dense index.

📊 Experiments & Results

Evaluation Setup

Few-shot (k=64, k=5, etc.) and full-dataset fine-tuning on knowledge-intensive tasks

Benchmarks:

MMLU (Multi-choice QA (57 domains))
Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)
FEVER (Fact Checking)
KILT (Suite of knowledge-intensive tasks)

Metrics:

Exact Match (EM) accuracy
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Atlas demonstrates superior few-shot performance compared to much larger closed-book models on Natural Questions.
Natural Questions (64-shot)	Accuracy	39.6	42.4	+2.8
Natural Questions (Full dataset)	Exact Match	43.5	64.0	+20.5
Performance on MMLU shows Atlas competing with GPT-3 in few-shot settings.
MMLU (5-shot)	Accuracy	43.9	43.4	-0.5
Natural Questions (64-shot)	Exact Match	39.9	45.0	+5.1
KILT Avg (64-shot)	Average score	26.5	43.0	+16.5

Main Takeaways

Joint pre-training of retriever and reader is critical for few-shot performance; models with fixed retrievers perform significantly worse.
Perplexity Distillation (PDist) is an effective and stable objective for training the retriever using signals from the language model.
Query-side fine-tuning is an efficient alternative to full fine-tuning for few-shot settings, matching performance without needing costly index updates.
Retrieval augmentation allows smaller models (11B) to outperform massive dense models (540B) on knowledge-intensive tasks, suggesting memory can be decoupled from reasoning.

📚 Prerequisite Knowledge

Prerequisites

Dense retrieval (Dual Encoders)
Sequence-to-sequence models (Encoder-Decoder)
Contrastive learning
Knowledge Distillation

Key Terms

Contriever: A dense information retrieval model based on continuous embeddings, pre-trained using contrastive learning

Fusion-in-Decoder (FiD): A sequence-to-sequence architecture where the encoder processes retrieved documents independently, and the decoder attends to their concatenated representations

Perplexity Distillation: A training objective where the retriever minimizes the KL-divergence between its document distribution and the language model's posterior distribution over documents

ADist: Attention Distillation—using the language model's cross-attention scores as supervision to train the retriever

EMDR2: End-to-end training of Multi-Document Reader and Retriever—an algorithm treating retrieved documents as latent variables to maximize the likelihood of the answer

MMLU: Massively Multitask Language Understanding—a benchmark covering 57 subjects like STEM, humanities, and social sciences

KILT: Knowledge-Intensive Language Tasks—a benchmark suite requiring external knowledge (Wikipedia) to solve tasks like QA and fact checking

FEVER: Fact Extraction and VERification—a dataset for fact-checking claims against evidence

query-side fine-tuning: Updating only the query encoder parameters while keeping the document encoder fixed to avoid costly index re-computation