Reverse KL: A divergence measure (KL(Student || Teacher)) that penalizes the student for generating samples unlikely under the teacher; it is mode-seeking and ignores teacher modes the student doesn't visit
Forward KL: A divergence measure (KL(Teacher || Student)) that penalizes the student for assigning low probability to samples likely under the teacher; it is mode-covering and forces the student to match the full teacher distribution
Mode-seeking: Behavior where a model focuses on a single high-probability peak of a distribution, ignoring others
Mode-covering: Behavior where a model spreads its probability mass to cover all high-probability peaks of the target distribution
Entropy: A measure of uncertainty or randomness in a distribution; high entropy means probability is spread across many tokens
On-Policy Distillation: Training a student model using samples generated by the student itself (corrected by the teacher), rather than fixed offline data
Pass@k: An evaluation metric that counts a problem as solved if at least one correct solution is found among k generated samples