midtraining: An intermediate training phase between general pretraining and specific posttraining that mixes specialized data with general data
posttraining: The final stage of training, typically supervised fine-tuning (SFT) on a specific target dataset
catastrophic forgetting: The tendency of a neural network to abruptly lose previously learned information upon learning new information
proximity advantage: A metric quantifying how much closer a midtraining dataset is to the target dataset compared to the original pretraining dataset, based on token statistics
continued pretraining: Training a pretrained model further on domain-specific data alone, without mixing in general pretraining data
plasticity window: A period early in training where the model's representations are malleable enough to adjust to new distributions without performance degradation
SFT: Supervised Fine-Tuning—training on input-output pairs to adapt the model to a specific task
C4: Colossal Clean Crawled Corpus—a large dataset of web text used for general pretraining
Starcoder: A large dataset of code used for midtraining in the programming domain