GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs generated from the same prompt, removing the need for a separate value function
sampling compute: The computational cost incurred by generating rollouts from the policy during RL training
rollouts: Complete sequences generated by the model in response to a prompt during the exploration phase of RL
pass@k (best@k): A metric measuring if at least one of k generated responses is correct (indicates coverage)
worst@k: A metric measuring if all k generated responses are correct (indicates robustness/sharpening)
IsoCompute: An analysis framework that compares performance across different hyperparameter allocations while keeping the total compute budget constant
KL divergence: A statistical distance measure used to prevent the RL policy from drifting too far from the initial reference model
H200-hours: A unit of compute measurement representing one hour of usage on an NVIDIA H200 GPU