SFT: Supervised Fine-Tuning—training the model to maximize the likelihood of ground-truth demonstrations.
RL: Reinforcement Learning—training the model to maximize a reward signal (e.g., correct answer) through exploration.
GRPO: Group Relative Policy Optimization—a memory-efficient RL algorithm that normalizes rewards within a group of outputs for the same prompt, removing the need for a value function.
Entropy: A measure of the randomness or uncertainty in the model's output distribution. High entropy means diverse outputs; low entropy means deterministic outputs.
OOD: Out-of-Distribution—test data that differs significantly from the training distribution.
Mode Collapse: A failure mode in generative models where the model produces limited varieties of samples (e.g., repeating the same answer).
KL Divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution.