SPT: Specialized Pretraining—the proposed method of mixing domain data into the pretraining corpus from the start.
NPT: Naive Pretraining—standard pretraining on general web data (e.g., Dolma) without upweighted domain data.
Dolma: A large open-source dataset of web text used for general pretraining in this paper.
WSD: Warmup-Stable-Decay—a learning rate schedule used here during finetuning.
Compute Multiplier: The factor by which SPT reduces the number of pretraining tokens needed to reach a target domain loss compared to NPT.
Overfitting Scaling Laws: Mathematical formulations derived in the paper to predict the optimal domain data fraction by modeling loss as a combination of learning and overfitting terms.
Replay: A strategy of mixing general pretraining data back into the finetuning stage to mitigate forgetting.
JSD: Jensen-Shannon Divergence—a metric used to measure the similarity between two probability distributions (here, data distributions).