SFT: Supervised Fine-Tuning—training a model to mimic a static dataset of correct examples
RL: Reinforcement Learning—training a model to maximize a reward signal by generating its own data (rollouts) and learning from feedback
behavior policy: The policy (or distribution) that generated the static offline dataset used for SFT (denoted as π_beta)
target policy: The policy currently being trained (denoted as π_theta)
importance sampling: A statistical technique used to estimate properties of one distribution while sampling from another by reweighting samples based on their likelihood ratio
OPE: Off-Policy Evaluation—estimating the value or performance of a target policy using data collected by a different behavior policy
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes the policy by comparing a group of outputs for the same input and reinforcing the better ones
Pass@K: A metric measuring the probability that at least one correct answer is generated out of K independent samples
NLL: Negative Log-Likelihood—the standard loss function used in language modeling and SFT
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution