GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same prompt, eliminating the need for a separate critic model for baselines
Effective Ratio: The proportion of samples in a batch that yield non-zero advantages (i.e., not all responses were correct or all incorrect), contributing meaningful gradients
Rollout: The process of generating a full sequence of tokens (a solution) from the language model policy, which is computationally expensive
Value Model: A neural network that predicts the expected reward (difficulty) of a prompt without generating a full response
PCL: Prompt Curriculum Learning—the proposed method that filters prompts using a value model to focus on intermediate difficulty
Sublinear vs Linear scaling: The observation that generation time grows slowly (sublinearly) with batch size initially due to parallelism, then linearly once compute saturation is reached