GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled outputs rather than using a learned value function
iGRPO: Iterative Group Relative Policy Optimization—the proposed method that adds a draft-generation and self-conditioning stage to GRPO
AIME: American Invitational Mathematics Examination—a challenging high-school level mathematics competition used as a benchmark
bootstrapping: A process where a system improves itself by using its own outputs (e.g., best drafts) as training signals
dynamic self-conditioning: Conditioning the model generation on its own previous best outputs, which evolve (change) as the model learns
PPO: Proximal Policy Optimization—a standard RL algorithm that limits how much the policy can change in one step to ensure stability
rollout: A single complete generation (completion) produced by the model during the RL training process
KL divergence: A statistical measure of how one probability distribution differs from another, used here to prevent the model from drifting too far from its original behavior