GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs sampled from the same input, eliminating the need for a separate value model
KL loss: Kullback-Leibler divergence loss—a penalty used to keep the trained model's policy close to the reference model to prevent instability
truncation masking: A technique where the advantage score for a response is zeroed out if the response hits the maximum length limit, preventing the model from learning from incomplete outputs
DAPO: Diversity-Aware Policy Optimization—a variant of GRPO that removes KL loss and uses high clipping ratios to encourage diverse outputs
LiveCodeBench: A benchmark for evaluating code generation models on problems from contests like LeetCode and AtCoder
pass@k: A metric measuring the probability that at least one of k generated solutions passes all test cases
O(n^2) attention: The computational complexity of self-attention mechanisms in Transformers, which grows quadratically with sequence length
critic reward: An estimated score predicting the quality of an output, used to guide training