← Back to Paper List

In-Context Pretraining: Language modeling beyond document boundaries

(MetaAI, UW, AI2) Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Gergely Szilvasy, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis
Meta AI, University of Washington, Allen Institute for AI
ICLR (2024)
Pretraining RAG QA Reasoning

📝 Paper Summary

Language Model Pretraining Data Curation / Data Filtering
In-Context Pretraining reorders pretraining data into sequences of semantically related documents, enabling models to learn reasoning across document boundaries without changing model architecture.
Core Problem
Standard pretraining concatenates random, unrelated documents to fill context windows, providing no learning signal for predicting across document boundaries.
Why it matters:
  • Current models struggle with complex contexts involving reasoning over conditioned documents despite long context windows
  • Random concatenation incurs unnecessary computational overhead for tokens that do not require communication between unrelated documents
  • Prior documents in a random sequence provide no signal for predicting the next document, wasting the potential of long-context attention
Concrete Example: When predicting tokens for 'For 2022, FIFA set the prize money at $42m', a standard model sees a random preceding document (e.g., about cooking). An in-context pretrained model sees a relevant document stating 'World Cup never awarded more than $10M before 2022', enabling it to predict 'the highest so far'.
Key Novelty
In-Context Pretraining (ICLM)
  • Reorders the massive pretraining corpus so that input contexts consist of sequences of semantically related documents rather than random concatenations
  • Uses a dense retrieval model to find nearest neighbors for every document in the corpus to build a document graph
  • Formulates document sorting as a Traveling Salesman Problem to create coherent chains of related documents without repeating data
Evaluation Highlights
  • +15% average improvement on 8 reading comprehension tasks (e.g., HotpotQA, SQuAD) compared to standard pretraining baselines
  • +8% average accuracy increase on 8 in-context learning datasets (including SST-2, AGNews) using 32 demonstration examples
  • +16% improvement in faithfulness to prior contexts, reducing hallucinations when conditioned on retrieved documents
Breakthrough Assessment
8/10
Simple yet highly effective intervention in the pretraining pipeline. Significant gains on reasoning tasks by strictly changing data order, scalable to billions of tokens.
×