SFT: Supervised Fine-tuning—adapting a pre-trained language model to a specific task using labeled examples
Loss Trajectory: The sequence of loss values recorded for a specific training example at multiple checkpoints throughout the training process
Proxy Model: A significantly smaller model (e.g., 70M params) used to compute efficient data selection metrics for a larger target model (e.g., 7B params)
Hessian: A matrix of second-order partial derivatives of the loss function, representing the curvature of the loss landscape
Incremental Gradient (IG): Optimization methods like Stochastic Gradient Descent that update parameters iteratively based on gradients of individual examples or mini-batches
MathInstruct: A large-scale dataset of mathematical problems and solutions used for instruction tuning LLMs
MIMIC-III: A widely used dataset containing de-identified health data associated with intensive care unit admissions, used here for clinical text summarization