GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of sampled outputs for the same input, removing the need for a separate value function critic
GRAE: Group Relative Advantage Estimation—the specific method within GRPO for calculating advantages by normalizing rewards within a group (typically zero-mean)
RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs on tasks where the final answer can be automatically checked (e.g., math, code)
Pass@k: A metric measuring the probability that at least one correct answer is generated among k samples
SFT: Supervised Fine-Tuning—training on labeled examples (demonstrations) before RL
CoT: Chain-of-Thought—intermediate reasoning steps generated by the model before the final answer
A-GRAE: Asymmetric Group Relative Advantage Estimation—the proposed method that modifies GRAE to weight negative samples more heavily and dynamically adjusts difficulty focus
entropy collapse: A reduction in the diversity of the model's outputs, leading to deterministic but potentially suboptimal behavior
DAPO: Difficulty-Aware Policy Optimization—a variant of GRPO that attempts to adjust for sample difficulty
Dr.GRPO: Deeply-Refined GRPO—another variant improving upon standard GRPO