GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt against the group average, avoiding a separate value model
Replay Buffer: A storage mechanism that saves past experiences (prompts, outputs, probabilities) to be reused for off-policy training
On-policy: Learning updates computed using data generated by the current version of the model policy
Off-policy: Learning updates computed using data generated by previous versions of the model policy (retrieved from a buffer)
Importance Sampling: A technique to estimate properties of a target distribution using samples from a different distribution, reweighting them by the ratio of their probabilities
KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from a reference distribution
Pass@1: An evaluation metric measuring the percentage of problems where the model's first generated answer is correct