OPCD: On-Policy Context Distillation—the proposed method that trains a student on its own generations to minimize reverse KL against a context-aware teacher
Reverse KL Divergence: An objective function (KL(Student || Teacher)) that penalizes the student for generating samples unlikely under the teacher, encouraging mode-seeking behavior
Forward KL Divergence: The standard objective (KL(Teacher || Student)) used in most distillation, which penalizes the student for missing parts of the teacher's distribution, often causing mode-covering (broad/blurry) outputs
Exposure Bias: The error accumulation that occurs when a model is trained on ground-truth/teacher trajectories but generates its own tokens autoregressively at test time
Experiential Knowledge Distillation: A process where a model solves problems, extracts lessons (experiences) from its traces, accumulates them into a context, and then distills that context into its weights
System Prompt Distillation: Compressing the behavioral instructions of a system prompt (e.g., 'You are a medical expert') into the model's weights so the prompt isn't needed at inference
Mode-seeking: A behavior where a model focuses on the most likely output (peak) of the target distribution rather than trying to cover all possibilities
Top-k approximation: Approximating the sum over the entire vocabulary by summing only the top-k most probable tokens to make KL calculation computationally feasible