GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding a separate critic model
RLVR: Reinforcement Learning with Verifiable Rewards—using ground-truth correctness (like math answers) as rewards instead of a learned reward model
DGAE: Difficulty-Balanced Group Advantage Estimation—a component of DGPO that normalizes advantages using Mean Absolute Deviation to ensure constant update magnitude regardless of question difficulty
DQW: Difficulty-Aware Question-Level Weighting—a mechanism in DGPO that assigns higher loss weights to questions with lower average accuracy (harder questions)
MQR: Multi-Aspect Question Reformulation—a data augmentation strategy that rewrites questions to be harder (e.g., more abstract) while preserving the original answer
MAD: Mean Absolute Deviation—average distance between each data point and the mean
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
PPO: Proximal Policy Optimization—a standard RL algorithm that constraints policy updates to prevent instability