CPT: Continual Pre-Training—further training a pre-trained model on domain-specific data to adapt it to new tasks
CMR: Critical Mixture Ratio—the maximum proportion of domain data usable in CPT without significantly degrading general performance
Feasible Mixture Ratio: A data mixture ratio that allows domain loss to decrease while keeping general loss within a specified tolerance of its original value
General Loss: The model's loss (error rate) on a broad, non-specific dataset (e.g., Common Crawl)
Domain Loss: The model's loss on a specific target dataset (e.g., Finance or Academic Papers)
Lagrange multiplier: A mathematical method used here to weigh the importance of maintaining general performance against improving domain performance
Catastrophic Forgetting: The tendency of neural networks to abruptly forget previously learned information upon learning new information
Power-law: A functional relationship where one quantity varies as a power of another (e.g., Loss = a * Tokens^-b), commonly found in LLM scaling