Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

📝 Paper Summary

LLM Finetuning Scaling Laws Catastrophic Forgetting

Injecting as little as 1% of pretraining data during finetuning prevents forgetting, governed by a precise multiplicative scaling law involving model size, dataset size, and mixture fraction.

Core Problem

Finetuning LLMs on small target datasets causes two major issues: rapid overfitting to the target domain and catastrophic forgetting of general pretraining knowledge.

Why it matters:

Specialized models are essential for specific tasks, but losing general capabilities (forgetting) limits their versatility and robustness
Current practices for mixing pretraining data are heuristic; practitioners lack a principled way to determine the optimal mixture ratio
Existing scaling laws focus on pretraining or simple finetuning performance, ignoring the precise dynamics of forgetting and data mixing

Concrete Example: When finetuning a Small (109M) model on the 'Arxiv' domain, the pretraining loss increases significantly (forgetting). However, injecting just p=1% of pretraining data keeps the pretraining loss nearly flat, preserving general knowledge without hurting target performance.

Key Novelty

Scaling Law for Forgetting with Data Injection

Derives a multiplicative scaling law predicting the pretraining loss after finetuning based on model size, finetuning tokens, and injection ratio
Demonstrates that pretraining data injection acts as a regularizer, where effective parameters for the pretraining task scale with (1 + Bp)N
Identifies that forgetting is primarily a capacity allocation issue: smaller models suffer most because they lack spare capacity to maintain old knowledge while learning new tasks

Architecture

Evolution of Finetuning Validation Loss, Train Loss, and Pretraining Loss (Forgetting) during training iterations for different mixture fractions p.

Evaluation Highlights

Mixing just p=1% of pretraining data effectively halts forgetting across all 12 domains and 5 model scales tested
The proposed scaling law predicts forgetting with a Bootstrapped Mean Relative Error (MRE) of just 0.40% across domains
Small models (Tiny, 41M) lose up to 95% of pretraining progress during finetuning, while Large models (XL, 1.27B) lose only ~20%, confirming capacity dependence

Breakthrough Assessment

8/10

Provides the first precise scaling law quantifying the impact of pretraining data injection on forgetting. The finding that 1% injection is sufficient is a highly practical rule of thumb.

⚙️ Technical Details

Problem Definition

Setting: Full-parameter finetuning of pretrained autoregressive language models on a specific target domain while mixing in a fraction p of pretraining data

Inputs: Pretrained model θ_0, Finetuning dataset D_ft, Pretraining dataset D_pt, Mixture fraction p

Outputs: Finetuned model parameters θ minimizing the mixture loss L_mix = (1-p)L_ft + pL_pt

Pipeline Flow

Pretraining Phase (Standard generic pretraining)
Finetuning Phase (Target domain optimization with data mixing)

System Modules

Pretraining (Training Phases)

Train base model on general corpus to acquire general knowledge

Model or implementation: GPT-2 style transformers (Tiny to XL)

Data Mixer

Create training batches by sampling from D_pt with probability p and D_ft with probability 1-p

Model or implementation: N/A (Data loading logic)

Finetuning (Training Phases)

Optimize model on mixed data until finetuning validation loss minimizes (bottom of U-curve)

Model or implementation: Initialized from θ_0

Novel Architectural Elements

None (uses standard Transformer architecture)

Modeling

Base Model: GPT-2 style transformers (5 sizes: 41M, 109M, 334M, 665M, 1.27B)

Training Method: Full-parameter finetuning with constant learning rate

Objective Functions:

Purpose: Minimize next-token prediction error on the mixed data distribution.

Formally: L_mix = E_{x~mix(p)}[ℓ(x, θ)] = (1 - p)L_ft + pL_pt

Adaptation: Full fine-tuning (not PEFT/LoRA)

Trainable Parameters: 100% (All parameters)

Training Data:

Pretraining: RedPajamaV2
Finetuning: 12 domains from The Pile (e.g., Arxiv, Github, Wikipedia)
Finetuning sizes: 300K, 900K, 3M, 9M, 30M tokens (log scale)

Key Hyperparameters:

context_length: 1024
vocabulary_size: 32000
optimizer: AdamW (weight decay 0.1)
+ 4 more
batch_size: 32 to 128 (varies by model size)
finetuning_learning_rate: 1/30 of peak pretraining LR (constant)
finetuning_steps: 12000
mixture_fractions_p: 0%, 0.1%, 0.5%, 1%, 5%

Compute: Single A100-80GB GPU fits largest model (1.27B). Training runs vary from <30 mins (Medium) to 7 hours (XL) depending on token count.

Comparison to Prior Work

vs. Kalajdzievski (2024): Uses full-parameter tuning across multiple scales (Tiny to XL) instead of PEFT on one model; measures forgetting on pretraining loss rather than transfer tasks.
vs. Zhang et al. (2024): Extends the multiplicative scaling law to explicitly model the forgetting (pretraining loss) as a function of mixing ratio p.
vs. Continual Pretraining methods (Ibrahim et al., 2024): Focuses on the data-scarce finetuning regime (overfitting/U-curve) rather than infinite stream continual learning.
+ 1 more
vs. LoRA/Adapters [not cited in paper]: Comparison not made, but paper argues full finetuning is more performant based on Zhang et al.

Limitations

Study restricted to models up to 1.3B parameters; unsure if laws hold for 100B+ models
Focuses only on next-token prediction loss, not downstream task accuracy or benchmark scores
Uses a constant learning rate for finetuning, which simplifies analysis but may not be optimal
Restricted to 'isocurve' pretraining (D = 100N), not exploring under/over-trained base models

Reproducibility

Code availability is not provided. Datasets (RedPajamaV2, The Pile) are public. Model architecture details (layers, heads, dims) are fully specified in Table 1. Hyperparameters for pretraining and finetuning are explicitly listed.

📊 Experiments & Results

Evaluation Setup

Finetuning pretrained models on specific Pile domains and measuring loss on both target and pretraining sets

Benchmarks:

The Pile (subsets) (Language Modeling (Next Token Prediction))

Metrics:

Finetuning Validation Loss (L_ft)
Pretraining Validation Loss (L_pt, proxy for forgetting)
Bootstrapped Mean Relative Error (MRE) of scaling law predictions
Statistical methodology: Bootstrapping (n=128) to estimate MRE and confidence of scaling law fits

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Finetuning scaling law fits: The multiplicative law accurately predicts finetuning loss across domains.
Average across domains	MRE (Finetuning)	1.36	0.89	-0.47
Forgetting scaling law fits: The proposed law with mixture term (1+Bp) predicts pretraining loss accurately.
Average across domains	MRE (Forgetting)	0.82	0.40	-0.42
Extrapolation capability: Laws fitted on smaller models/data predict larger scale performance.
Large/XL Models on 9M/30M tokens	MRE (Prediction Error)	0.00	0.83	0.83
Capacity dependence of forgetting: Smaller models suffer more severe forgetting.
Pretraining Loss Progress	% Progress Lost	20	95	75

Experiment Figures

Trade-off between Finetuning Loss (Generalization) and Pretraining Loss (Memorization/Forgetting) for 5 model sizes.

Validation of the scaling law predictions against empirical measurements for Finetuning Loss on Wikipedia.

Validation of the scaling law predictions against empirical measurements for Forgetting (Pretraining Loss) on Github.

Main Takeaways

Injecting p=1% pretraining data is a robust 'rule of thumb' that largely eliminates catastrophic forgetting without harming finetuning performance
Forgetting is heavily dependent on model capacity: small models must overwrite general knowledge to learn the target task, while large models have spare capacity
Scaling laws for finetuning and forgetting are multiplicative, not additive; finetuning loss is largely independent of mixture p (for small p), while forgetting depends strongly on p
Domains distinct from pretraining data (e.g., Mathematics) induce more forgetting and benefit more from data injection (higher B coefficient) than similar domains (e.g., Wikipedia)
Only ~0.3 unique pretraining tokens per unique finetune token are needed to prevent forgetting (efficient data usage)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Neural Scaling Laws (power laws linking loss to compute/data/size)
Familiarity with LLM pretraining and finetuning workflows
Concept of catastrophic forgetting in continual learning

Key Terms

Scaling Laws: Empirical power-law relationships that predict model performance (loss) based on scale factors like parameter count (N) and dataset size (D)

Catastrophic Forgetting: The tendency of neural networks to abruptly lose previously learned information upon learning new information

Pretraining Data Injection: Mixing a small fraction of the original pretraining data into the finetuning batch to preserve general capabilities (also called replay or mixing)

U-curve: The trajectory of validation loss during training, which decreases initially (learning) and then increases (overfitting); the minimum point is the optimal stopping point

IsoFLOPS: A constraint or analysis method fixing the total floating-point operations (compute budget) to find optimal trade-offs between model size and training tokens

Effective Parameters: A conceptual adjustment in the scaling law ( (1+Bp)N ) representing how data injection effectively increases the model capacity available for the pretraining task

Rewarming: The phenomenon where finetuning starts with a learning rate higher than the final pretraining LR, causing a slight initial spike in loss before settling

Huber Loss: A robust loss function used here for fitting scaling law coefficients, less sensitive to outliers than squared error

Bootstrapped MRE: Mean Relative Error calculated via bootstrap resampling to estimate the predictive accuracy and stability of the scaling law fit