GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same prompt, removing the need for a separate value network
advantage: A value indicating how much better an action was compared to a baseline; positive advantage reinforces the action, negative advantage suppresses it
rollout: A complete sequence of text generated by the model in response to a prompt during the training process
sign flip: When a noisy baseline causes the calculated advantage of a trajectory to change from positive to negative (or vice versa) compared to the 'true' advantage, reversing the learning signal
MAD: Median Absolute Deviation—a robust measure of variability used here to normalize advantages instead of standard deviation
clipping: Limiting the policy update size (e.g., in PPO/GRPO) to prevent the new policy from deviating too wildly from the old one
KL regularization: A penalty term that prevents the trained model from drifting too far from its original pre-trained distribution