TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining

📝 Paper Summary

Continual Pretraining Lifelong Learning Temporal Distribution Shift

TiC-LM introduces a massive time-stratified benchmark derived from 114 Common Crawl dumps to evaluate how well LLMs can be continually updated on evolving web data without catastrophic forgetting.

Core Problem

Current LLMs suffer from knowledge cutoffs and require expensive re-training from scratch to update, while existing continual learning benchmarks are too small (single domain) or lack long-term temporal shifts to model realistic web-scale evolution.

Why it matters:

Retraining LLMs from scratch for every update is prohibitively expensive in terms of compute and energy
Existing benchmarks focus on single domains (e.g., Wikipedia) or few timesteps, failing to capture the complex distribution shifts of the general web over decades
Models deteriorate on new data due to knowledge cutoffs, but simply fine-tuning on new data causes catastrophic forgetting of older knowledge

Concrete Example: A model trained on data up to 2016 performs well on NumPy (released 1995) but fails on PyTorch (released 2016). Continual training on 2017+ data might learn PyTorch but 'forget' NumPy details unless specific replay strategies are used.

Key Novelty

TiC-CC: A Web-Scale Time-Stratified Dataset & Benchmark

Constructs a 2.9 trillion token dataset from 114 monthly Common Crawl dumps (2013–2024), preserving strict temporal causality (no future data leakage)
Establishes a 10+ year experimental setup where models must update incrementally month-by-month, mirroring a realistic lifelong learning scenario
Introduces dynamic domain-specific evaluations (TiC-Wiki, TiC-StackExchange, TiC-CodeDocs) to measure how forgetting varies across stable vs. rapidly evolving knowledge

Architecture

The TiC-LM benchmark construction and evaluation pipeline.

Evaluation Highlights

Continual pretraining with replay and learning rate schedules matches the performance of re-training from scratch (Oracles) while requiring 2.6x less compute
On general web data (TiC-CC), replay is essential; without it, models suffer significant catastrophic forgetting of older dumps
Forgetting is domain-dependent: Replay hurts performance on rapidly evolving topics like PyTorch (where old data is obsolete) but helps on stable topics like NumPy

Breakthrough Assessment

9/10

Significantly scales up continual learning research by orders of magnitude (2.9T tokens vs prior ~100B benchmarks) and provides the first realistic web-scale testbed for lifelong LLM training.

⚙️ Technical Details

Problem Definition

Setting: Continual Pretraining on a sequence of time-stratified web data dumps

Inputs: Stream of text data partitioned by month (D_1, D_2, ..., D_T)

Outputs: Updated language model parameter set theta_t after each timestamp t

Pipeline Flow

Temporal Data Splitting (114 months)
Initial Pretraining (Month 1)
Continual Updates (Months 2-114)

System Modules

Data Processor

Process Common Crawl dumps into time-stratified splits

Model or implementation: Resiliparse + RefinedWeb filters

Continual Learner

Update model weights using new monthly data + optional replay

Model or implementation: Transformer (OpenLM)

Novel Architectural Elements

Time-stratified data pipeline capable of handling 2.9T tokens across 114 distinct time steps without future leakage

Modeling

Base Model: Decoder-only Transformer (1B and 3B parameters)

Training Method: Continual Pretraining (Standard Autoregressive Modeling)

Objective Functions:

Purpose: Minimize negative log-likelihood of the next token.

Formally: Standard Cross-Entropy Loss.

Training Data:

TiC-CC: 2.9T tokens total from Common Crawl (2013-2024)
Initial pretraining on first month (May 2013) with 110B tokens
Remaining budget split equally among 113 subsequent months

Key Hyperparameters:

global_batch_size: Not explicitly reported in the paper
learning_rate: 1e-4 (max for cyclic schedules)
sequence_length: 2048
+ 2 more
replay_ratio_alpha: 0.5 (fixed) or 1/t (decaying)
warmup_steps: 2000 (implied from typical OpenLM configs, specific value for cyclic runs varies)

Compute: 220B or 440B total training tokens (4x or 8x Chinchilla optimal). Specific GPU hours not reported.

Comparison to Prior Work

vs. TemporalWiki: TiC-LM uses general web data (CC) vs. just Wikipedia, and spans 114 months vs. ~10 [not cited in paper]
vs. Standard Pretraining: TiC-LM evaluates incremental updates vs. one-off training
vs. TiC-CLIP: TiC-LM focuses on text-only LLMs vs. vision-language models

Limitations

Current experiments limited to 3B parameter models due to compute constraints (though dataset supports larger)
Replay buffer selection is naive (random sampling from past); does not use active selection
Does not explore cross-month deduplication extensively in main experiments
Evaluation is perplexity-heavy; fewer downstream task evaluations compared to static benchmarks

Reproducibility

Code: https://github.com/apple/ml-tic-lm

Code available at https://github.com/apple/ml-tic-lm. Includes data processing pipeline (based on DataComp-LM) and evaluation scripts. Full 29T token dataset is reconstructible from Common Crawl using provided scripts. Models are standard OpenLM architectures.

📊 Experiments & Results

Evaluation Setup

Time-stratified perplexity evaluation and static zero-shot tasks

Benchmarks:

TiC-CC (Web text modeling (Perplexity)) [New]
TiC-Wiki (Diff/Unchanged) (Knowledge retention/update (Perplexity on proper nouns)) [New]
TiC-StackExchange (Technical Q&A modeling (Answer Perplexity)) [New]
TiC-CodeDocs (Code documentation modeling (NumPy/PyTorch)) [New]

Metrics:

In-Distribution (ID) Performance
Backward Transfer (Retaining old knowledge)
Forward Transfer (Zero-shot generalization to future)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TiC-CC	Compute (FLOPs/Tokens)	1.16T tokens	440B tokens	-720B tokens
Forgetting analysis on general web data (TiC-CC) shows replay is critical.
TiC-CC	Backward Transfer (Regret)	0.0245	0.0096	-0.0149
TiC-CC	Backward Transfer (Regret)	0.0270	0.0049	-0.0221
Domain-specific results show replay can be harmful for rapidly evolving domains.
TiC-StackOverflow	Backward Transfer	0.000	0.010	+0.010
TiC-StackE-Math	Backward Transfer	0.004	0.000	-0.004

Experiment Figures

Performance (Loss) vs. Compute (FLOPs) for Continual Learning methods vs. Oracle baselines.

Heatmaps of Forward/Backward Transfer on TiC-CC for different methods.

Main Takeaways

Optimal continual learning strategy is domain-dependent: Replay is crucial for general web data (TiC-CC) and stable domains (Math, NumPy), but harmful for rapidly evolving domains (StackOverflow, PyTorch).
Autoregressive (AR) learning rate schedules combined with Replay provide the best balance of stability (remembering old) and plasticity (learning new) for general web data.
TiC-Wiki evaluations suggest factual knowledge might be learned 'late' compared to its timestamp, possibly due to the lag in Wikipedia updates appearing in Common Crawl.
Simply scaling up the model (1B to 3B) improves performance but does not solve the fundamental trade-off between plasticity and forgetting; algorithmic interventions like Replay are still necessary.

📚 Prerequisite Knowledge

Prerequisites

Language Model Pretraining (scaling laws, tokenization)
Continual Learning concepts (catastrophic forgetting, replay, backward transfer)
Optimization schedules (Cosine decay, warmup)

Key Terms

Common Crawl (CC): A massive, open repository of web crawl data used to train most modern LLMs

Catastrophic Forgetting: The tendency of neural networks to abruptly forget previously learned information upon learning new information

Replay: A continual learning strategy where a portion of the training budget is allocated to data from previous time steps to prevent forgetting

Oracle: A baseline model re-trained from scratch on all available data up to a certain point, representing the theoretical upper bound for performance

Backward Transfer: A metric measuring how well a model trained on newer data performs on older, previously seen data evaluations

Forward Transfer: A metric measuring how well a model trained on older data performs on future, unseen data

Chinchilla optimal: A compute-optimal ratio of training tokens to model parameters (approx. 20 tokens per parameter)

Perplexity (ppl): A measurement of how well a probability model predicts a sample; lower values indicate better performance

EWC: Elastic Weight Consolidation—a regularization method that slows down updates to parameters important for previous tasks

LwF: Learning without Forgetting—a regularization method using knowledge distillation to preserve original model behavior