The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

📝 Paper Summary

Domain Adaptation Pretraining Strategies Continual Learning / Forgetting

Interleaving domain-specific data throughout pretraining (Specialized Pretraining) yields better performance and less overfitting than the standard practice of reserving domain data exclusively for finetuning.

Core Problem

Standard domain adaptation treats pretraining and finetuning as disjoint phases, reserving specialized data for finetuning. This often leads to rapid overfitting on the small domain dataset and catastrophic forgetting of general knowledge.

Why it matters:

Organizations often rely on finetuning for proprietary data (legal, medical), assuming it is the most efficient path, but this may yield suboptimal models compared to early data integration.
Finetuning on small corpora requires aggressive updates that degrade general capabilities, while pretraining models from scratch is often viewed as too expensive.
Current scaling laws do not account for the trade-offs between repeated domain data during pretraining versus finetuning.

Concrete Example: A 1B model trained with standard pretraining (Web data) followed by finetuning on 'ProofPile' (math) overfits rapidly after ~5 epochs. In contrast, a model that sees ProofPile mixed into pretraining (SPT) sustains performance improvement for far longer and matches the performance of a 3B standard model.

Key Novelty

Specialized Pretraining (SPT)

Mix a small fraction of domain-specific data (e.g., 2%) into the general pretraining corpus from the start, repeating it as necessary (up to ~50x), rather than saving it for finetuning.
Derives 'overfitting scaling laws' that model test loss as a sum of learning (power law) and overfitting (gap growing with repetitions), allowing prediction of optimal data mixing ratios.

Architecture

Contrast between Naive Pretraining (NPT) and Specialized Pretraining (SPT) workflows.

Evaluation Highlights

On the 'ProofPile' domain, a 1B parameter SPT model outperforms a 3B parameter standard model, effectively closing >100% of the performance gap.
SPT reduces the pretraining tokens needed to reach a specific domain loss by up to 1.75x compared to standard pretraining (on MusicPile).
Improves downstream accuracy by up to 6 percentage points on MATH and 4 percentage points on MusicTheoryBench compared to the finetuning-only baseline.

Breakthrough Assessment

8/10

Challenges the standard industry practice of 'pretrain then finetune' for domain adaptation. Provides actionable scaling laws and demonstrates that smaller, specialized models can beat larger general models.

⚙️ Technical Details

Problem Definition

Setting: Language Modeling (Next Token Prediction) under domain shift

Inputs: Context sequence of tokens

Outputs: Probability distribution over the next token

Pipeline Flow

Data Mixing (General + Domain)
Pretraining (SPT or NPT)
Finetuning (Domain Only)

System Modules

Data Mixer

Interleaves domain tokens with general web tokens at a fixed ratio delta

Model or implementation: N/A (Data processing)

Pretrainer (Training)

Trains the model on the mixed token stream

Model or implementation: OLMo (1B, 300M, 600M, or 3B variants)

Finetuner (Training)

Adapts the pretrained model exclusively on domain data

Model or implementation: OLMo (initialized from Pretrainer)

Modeling

Base Model: OLMo-1B (and variants: 300M, 600M, 3B)

Training Method: Full parameter finetuning

Objective Functions:

Purpose: Minimize negative log-likelihood of next token.

Formally: Standard Cross-Entropy Loss

Adaptation: Full finetuning

Training Data:

Pretraining: Dolma (web data) + Domain Data (MusicPile, ChemPile, or ProofPile)
Finetuning: Domain Data only (approx 300M tokens each)

Key Hyperparameters:

pretraining_tokens: 200B
domain_mixture_fractions_delta: 0, 0.1%, 1%, 2%, 5%
optimizer: AdamW (implied by OLMo settings)
+ 2 more
learning_rate_schedule: Cosine for pretraining, WSD for finetuning
batch_size: Matched to OLMo-1B public config

Compute: Not reported in the paper

Comparison to Prior Work

vs. Replay: SPT introduces domain data *during* the initial pretraining phase rather than just during a continued pretraining or finetuning phase.
vs. Naive Pretraining (NPT): SPT modifies the pretraining data distribution to include target domain data, whereas NPT uses generic web text.

Limitations

Requires access to the pretraining pipeline, which is computationally expensive compared to finetuning alone.
Benefits diminish if the domain data is already well-represented in the general pretraining corpus.
Risk of overfitting if the domain mixture fraction is too high for the given compute budget (modeled by their scaling laws).

Reproducibility

The paper uses open datasets (Dolma, MusicPile, ChemPile, ProofPile) and open model architectures (OLMo). Code URL is not provided in the paper text. Hyperparameters for pretraining match publicly documented OLMo configuration.

📊 Experiments & Results

Evaluation Setup

Language modeling perplexity and downstream QA tasks on specialized domains.

Benchmarks:

MusicPile (Symbolic music modeling)
ChemPile / ChemBench (Chemistry text modeling and QA)
ProofPile / MATH (Formal mathematics modeling and QA)
MusicTheoryBench (Music theory QA)

Metrics:

Domain Test Loss (Perplexity)
General Test Loss (on Dolma)
Downstream Task Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SPT improves domain test loss relative to Naive Pretraining (NPT) followed by Finetuning (FT) across all three domains.
MusicPile	Relative Loss Improvement	0 (reference)	2.0	+2.0
ProofPile	Relative Loss Improvement	0 (reference)	1.5	+1.5
ChemPile	Relative Loss Improvement	0 (reference)	0.8	+0.8
Efficiency gains: SPT requires fewer pretraining tokens to reach the same post-finetuning domain loss.
MusicPile	Compute Multiplier	1.0	1.75	+0.75
ProofPile	Compute Multiplier	1.0	1.56	+0.56
Model scaling comparison: A smaller SPT model can outperform a larger NPT model.
ProofPile	Gap Closed (%)	0	133	+133
MusicPile	Gap Closed (%)	0	81	+81

Experiment Figures

Comparison of Domain Test Loss and Compute Efficiency between NPT and SPT.

Downstream accuracy on MusicTheoryBench, ChemBench, and MATH.

Evolution of Train-Test gap (overfitting) during pretraining vs. finetuning.

Main Takeaways

Early integration of domain data (SPT) provides dual benefits: better domain specialization and reduced forgetting of general knowledge compared to finetuning alone.
The benefit of SPT is most pronounced when the target domain is distributionally distinct from the pretraining corpus (e.g., Music/Code vs. Web Text).
Overfitting happens much slower during pretraining (due to regularization from general data) than during finetuning, allowing the model to see domain data many more times (up to 50x repetitions) without degrading performance.
There is an optimal mixture fraction for SPT; too little provides insufficient signal, while too much leads to overfitting, governed by the derived scaling laws.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) pretraining vs. finetuning pipelines
Scaling laws in deep learning
Concept of overfitting and catastrophic forgetting

Key Terms

SPT: Specialized Pretraining—the proposed method of mixing domain data into the pretraining corpus from the start.

NPT: Naive Pretraining—standard pretraining on general web data (e.g., Dolma) without upweighted domain data.

Dolma: A large open-source dataset of web text used for general pretraining in this paper.

WSD: Warmup-Stable-Decay—a learning rate schedule used here during finetuning.

Compute Multiplier: The factor by which SPT reduces the number of pretraining tokens needed to reach a target domain loss compared to NPT.

Overfitting Scaling Laws: Mathematical formulations derived in the paper to predict the optimal domain data fraction by modeling loss as a combination of learning and overfitting terms.

Replay: A strategy of mixing general pretraining data back into the finetuning stage to mitigate forgetting.

JSD: Jensen-Shannon Divergence—a metric used to measure the similarity between two probability distributions (here, data distributions).