Sub-scaling: A phenomenon where performance improvements decelerate faster than predicted by traditional power laws, often due to data redundancy or non-optimal resource allocation.
Data Density: A metric quantifying redundancy; high density means samples are clustered closely (repetitive), contributing less new information.
OTR: Over-Training Ratio—the ratio of training tokens D to model parameters N (D/N). High OTR indicates training a relatively small model on a massive amount of data.
Chinchilla Law: A scaling law proposing that for compute-optimal training, model size and training tokens should scale equally.
MAPE: Mean Absolute Percentage Error—a measure of prediction accuracy used to evaluate how well the scaling laws fit the actual loss curves.
The Pile: A large-scale, diverse text dataset commonly used for training LLMs, consisting of 22 different domains.
Common Crawl: A massive dataset of web crawl data, often containing high redundancy.
FLOPs: Floating Point Operations—a measure of compute budget. For Transformers, usually approximated as 6 * N * D.