Scaling Laws: Empirical power-law relationships that predict model performance (loss) based on scale factors like parameter count (N) and dataset size (D)
Catastrophic Forgetting: The tendency of neural networks to abruptly lose previously learned information upon learning new information
Pretraining Data Injection: Mixing a small fraction of the original pretraining data into the finetuning batch to preserve general capabilities (also called replay or mixing)
U-curve: The trajectory of validation loss during training, which decreases initially (learning) and then increases (overfitting); the minimum point is the optimal stopping point
IsoFLOPS: A constraint or analysis method fixing the total floating-point operations (compute budget) to find optimal trade-offs between model size and training tokens
Effective Parameters: A conceptual adjustment in the scaling law ( (1+Bp)N ) representing how data injection effectively increases the model capacity available for the pretraining task
Rewarming: The phenomenon where finetuning starts with a learning rate higher than the final pretraining LR, causing a slight initial spike in loss before settling
Huber Loss: A robust loss function used here for fitting scaling law coefficients, less sensitive to outliers than squared error
Bootstrapped MRE: Mean Relative Error calculated via bootstrap resampling to estimate the predictive accuracy and stability of the scaling law fit