pass rate: The probability p that the student model generates a correct solution for a given problem x, estimated via K rollouts
forward KL: Minimizing KL(P_Teacher || P_Student); forces the student to cover all modes of the teacher's distribution (preventing mode collapse but potentially including low-probability tails)
reverse KL: Minimizing KL(P_Student || P_Teacher); forces the student to focus on high-probability modes of the teacher (seeking mode seeking/consolidation)
Beta kernel: A weight function w(p) = p^α(1-p)^β that peaks at intermediate pass rates and vanishes at 0 and 1, matching the theoretical SNR profile of distillation gradients
zone of proximal development: The set of problems where the student is neither fully incompetent nor fully masterful, representing the most efficient training signal
minimax-robust: A guarantee that the worst-case efficiency loss is bounded even if the true SNR profile deviates from the assumed Beta model
gradient signal-to-noise ratio (SNR): The ratio of the squared norm of the expected gradient to the trace of the gradient covariance matrix; a measure of learning efficiency
SFT: Supervised Fine-Tuning—standard training on labeled data (often hard labels)
MMLU: Massive Multitask Language Understanding—a benchmark measuring general knowledge across many subjects, used here to measure catastrophic forgetting
MATH-500: A benchmark dataset of 500 mathematics problems used to evaluate reasoning capability
AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark