In-Context Pretraining: Language modeling beyond document boundaries

📝 Paper Summary

Language Model Pretraining Data Curation / Data Filtering

In-Context Pretraining reorders pretraining data into sequences of semantically related documents, enabling models to learn reasoning across document boundaries without changing model architecture.

Core Problem

Standard pretraining concatenates random, unrelated documents to fill context windows, providing no learning signal for predicting across document boundaries.

Why it matters:

Current models struggle with complex contexts involving reasoning over conditioned documents despite long context windows
Random concatenation incurs unnecessary computational overhead for tokens that do not require communication between unrelated documents
Prior documents in a random sequence provide no signal for predicting the next document, wasting the potential of long-context attention

Concrete Example: When predicting tokens for 'For 2022, FIFA set the prize money at $42m', a standard model sees a random preceding document (e.g., about cooking). An in-context pretrained model sees a relevant document stating 'World Cup never awarded more than $10M before 2022', enabling it to predict 'the highest so far'.

Key Novelty

In-Context Pretraining (ICLM)

Reorders the massive pretraining corpus so that input contexts consist of sequences of semantically related documents rather than random concatenations
Uses a dense retrieval model to find nearest neighbors for every document in the corpus to build a document graph
Formulates document sorting as a Traveling Salesman Problem to create coherent chains of related documents without repeating data

Evaluation Highlights

+15% average improvement on 8 reading comprehension tasks (e.g., HotpotQA, SQuAD) compared to standard pretraining baselines
+8% average accuracy increase on 8 in-context learning datasets (including SST-2, AGNews) using 32 demonstration examples
+16% improvement in faithfulness to prior contexts, reducing hallucinations when conditioned on retrieved documents

Breakthrough Assessment

8/10

Simple yet highly effective intervention in the pretraining pipeline. Significant gains on reasoning tasks by strictly changing data order, scalable to billions of tokens.

⚙️ Technical Details

Problem Definition

Setting: Language Model Pretraining on large-scale corpora

Inputs: A set of documents D from a pretraining corpus (e.g., CommonCrawl)

Outputs: A sequence of input contexts C_1 ... C_m where each context contains a list of semantically related documents

Pipeline Flow

Graph Construction: Build document graph using retrieval embeddings
Path Finding: Traverse graph to create coherent document sequences
Pretraining: Train LM on reordered sequences

System Modules

Document Graph Construction (Data Preparation)

Identify related documents for every document in the corpus

Model or implementation: Contriever (encoder) + FAISS (index)

Graph Traversal (Sorting) (Data Preparation)

Sort documents into a sequence maximizing similarity between adjacent documents without repetition

Model or implementation: Greedy Approximation Algorithm for Maximum Traveling Salesman Problem

Language Model

Predict next tokens conditioned on the reordered long contexts

Model or implementation: LLaMA architecture (up to 7B parameters)

Novel Architectural Elements

In-Context Pretraining data pipeline: Pre-computation of a global document traversal path to maximize local context coherence without data repetition

Modeling

Base Model: LLaMA architecture (0.3B to 7B parameters)

Training Method: Pretraining from scratch

Training Data:

235 million documents sampled from English CommonCrawl
306 billion tokens total
Deduplicated using retrieval scores

Key Hyperparameters:

context_length: 8192
optimizer: AdamW
beta_1: 0.9
+ 3 more
beta_2: 0.95
batch_size: 4 million tokens (for 7B model)
learning_rate_schedule: cosine

Compute: 7B model: 128 A100 GPUs for 9 days. Retrieval search: 6 hours on 32 GPUs. Graph traversal: 12 hours on 20 CPUs.

Comparison to Prior Work

vs. Standard: ICLM sorts documents by semantic similarity rather than random order
vs. kNN: ICLM ensures every document is seen exactly once (via TSP formulation) whereas kNN causes data repetition and overfitting to popular documents
vs. RAM / REALM [not cited in paper]: ICLM is a pretraining data organization strategy, not a retrieval-augmented architecture that retrieves at inference time

Limitations

Requires efficient nearest neighbor search over the entire pretraining corpus, which is computationally heavy for massive datasets
Exact TSP is NP-hard; relies on greedy approximations which may be suboptimal
Benefits primarily observed in tasks requiring complex contextual reasoning; gains on simple tasks may be smaller
Graph construction assumes static corpus; harder to update incrementally than random shuffling

Reproducibility

Code: https://github.com/swj0419/in-context-pretraining

Code publicly available at github.com/swj0419/in-context-pretraining. Uses public CommonCrawl data (though specific 235M document subset requires the provided indices). Uses standard LLaMA architecture and Contriever model.

📊 Experiments & Results

Evaluation Setup

Pretraining LMs from scratch on 300B tokens and evaluating on downstream tasks

Benchmarks:

Language Modeling (Perplexity) (Next token prediction on Wikipedia, Arxiv, Books)
In-Context Learning (Classification (SST-2, AGNews, etc.) with 32 shots)
Reading Comprehension (QA (RACE, SQuAD, HotpotQA, BoolQ, DROP))
Retrieval Augmentation (Open-domain QA (Natural Questions, TriviaQA))

Metrics:

Perplexity
Accuracy
Exact Match (EM)
Faithfulness (MemoTrap)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Reading Comprehension results show ICLM significantly outperforming baselines, especially on multi-hop tasks like HotpotQA.
HotpotQA	Exact Match	17.4	23.6	+6.2
RACE-High	Accuracy	44.6	47.2	+2.6
Average (8 datasets)	Score (Acc/EM)	42.0	48.2	+6.2
In-Context Learning evaluation demonstrates ICLM's superior ability to learn from demonstrations.
Average (8 classification datasets)	Accuracy	69.0	74.8	+5.8
Retrieval Augmentation results show ICLM is better at utilizing external context.
Natural Questions (Open-Book)	Exact Match	18.8	21.6	+2.8
Faithfulness evaluation using MemoTrap.
MemoTrap	Accuracy	50.1	58.1	+8.0

Experiment Figures

Language modeling perplexity on Wikipedia, Arxiv, and Books for Standard, kNN, and ICLM models across different parameter scales.

Main Takeaways

In-Context Pretraining consistently outperforms standard training and kNN baselines across model scales (0.3B to 7B) and diverse tasks.
The method is particularly effective for tasks requiring complex reasoning over context (Reading Comprehension) and utilization of provided demonstrations (In-Context Learning).
kNN pretraining (allowing repeats) often underperforms standard training due to overfitting/lack of diversity, validating the need for the non-repeating TSP graph traversal approach.
Deduplication during graph construction is crucial; without it, performance drops significantly.

📚 Prerequisite Knowledge

Prerequisites

Language Model Pretraining (Next Token Prediction)
Dense Retrieval / Nearest Neighbor Search
Graph Theory (Traveling Salesman Problem)

Key Terms

Contriever: A dense retrieval model used to embed documents and find semantic similarities

FAISS: A library for efficient similarity search and clustering of dense vectors

Traveling Salesman Problem (TSP): An algorithmic problem of finding the most efficient route that visits every node exactly once; used here to order documents

Perplexity: A measurement of how well a probability model predicts a sample; lower is better

In-context Learning: The ability of a model to perform a task given a few examples in the prompt without parameter updates

kNN: k-Nearest Neighbors—a baseline method where contexts are formed by a document and its top-k similar documents (allowing repetition)

Flash Attention: An algorithm to compute exact attention with fewer memory accesses, speeding up training on long sequences

Zero-shot / Few-shot: Testing a model with zero or a small number of example inputs in the prompt