SFT: Supervised Fine-Tuning—training a model to maximize the likelihood of ground-truth responses (equivalent to minimizing Forward KL)
RL: Reinforcement Learning—training a model to maximize a reward signal, often using on-policy generations (equivalent to minimizing Reverse KL)
Forward KL: Kullback-Leibler divergence direction KL(P_target || P_model), which forces the model to cover the entire target distribution (mode-covering)
Reverse KL: Kullback-Leibler divergence direction KL(P_model || P_target), which allows the model to focus on the highest probability regions of the target (mode-seeking)
On-policy data: Training data generated by the model currently being trained (used in RL), as opposed to fixed external data
GRPO: Group Relative Policy Optimization—an RL algorithm used for tasks with verifiable outputs that normalizes rewards within a group of generations
Catastrophic Forgetting: The tendency of neural networks to lose previously learned information upon learning new information
Mode-seeking: A property of probability distributions where the approximation focuses on one or few peaks (modes) of the target, ignoring others
Mode-covering: A property where the approximation stretches to cover the entire support of the target distribution, often averaging across modes
Self-SFT: A baseline method where the model is fine-tuned on its own correct generations produced by the initial policy (off-policy)