Common Crawl (CC): A massive, open repository of web crawl data used to train most modern LLMs
Catastrophic Forgetting: The tendency of neural networks to abruptly forget previously learned information upon learning new information
Replay: A continual learning strategy where a portion of the training budget is allocated to data from previous time steps to prevent forgetting
Oracle: A baseline model re-trained from scratch on all available data up to a certain point, representing the theoretical upper bound for performance
Backward Transfer: A metric measuring how well a model trained on newer data performs on older, previously seen data evaluations
Forward Transfer: A metric measuring how well a model trained on older data performs on future, unseen data
Chinchilla optimal: A compute-optimal ratio of training tokens to model parameters (approx. 20 tokens per parameter)
Perplexity (ppl): A measurement of how well a probability model predicts a sample; lower values indicate better performance
EWC: Elastic Weight Consolidation—a regularization method that slows down updates to parameters important for previous tasks
LwF: Learning without Forgetting—a regularization method using knowledge distillation to preserve original model behavior