GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to estimate advantages without a value network.
RLVR: Reinforcement Learning with Verifiable Rewards—a setting where the environment provides a ground-truth signal (e.g., correct math answer) to score generations.
CoT: Chain-of-Thoughts—a prompting strategy where the model generates intermediate reasoning steps before the final answer.
Rollout: A single complete generation (trajectory) produced by the policy model given a prompt.
Replay Learning: A technique where past high-value training samples are stored in a buffer and re-sampled later to reinforce learning.
GSPO: Group Sequence Policy Optimization—a variant of GRPO using sequence-level importance ratios.
DAPO: Decoupled Clip and Dynamic sampling Policy Optimization—a variant of GRPO with specific clipping and sampling mechanisms.
Test-Time Scaling: The phenomenon where generating more tokens or samples at inference time (e.g., longer reasoning chains) leads to better performance.