← Back to Paper List

On the Hidden Objective Biases of Group-based Reinforcement Learning

Aleksandar Fontana, Marco Simoni, Giulio Rossolini, Andrea Saracino, Paolo Mori
Scuola Superiore Sant’Anna, Pisa, Institute of Informatics and Telematics, National Research Council of Italy, Pisa, Sapienza Università di Roma
arXiv (2026)
RL Reasoning

📝 Paper Summary

Group Relative Policy Optimization (GRPO) Large Language Model Post-training Optimizer Dynamics (AdamW)
A unified theoretical analysis of group-based reinforcement learning reveals that surrogate objectives introduce structural biases on shared tokens and interact with AdamW to bypass reward scaling and clipping constraints.
Core Problem
GRPO-style methods achieve empirical success but rely on heuristic surrogate objectives that theoretically diverge from the true reward maximization goal, leading to unexplained biases and instabilities.
Why it matters:
  • Current understanding of GRPO dynamics is fragmented (e.g., unexplained length biases, reward hacking), leading to 'voodoo' hyperparameter tuning
  • Standard reinforcement learning intuitions, such as scaling rewards to stabilize training, fail unexpectedly when combined with AdamW and group-relative advantages
  • The trusted region mechanism (clipping) intended to stabilize training is structurally undermined by optimizer momentum, causing silent optimization drift
Concrete Example: When a weighting scheme inversely proportional to length is used (to penalize verbosity), the method implicitly biases the gradients of the *shared prefix* (the prompt and initial tokens) based on the length of the *future* completion, even though the prefix is identical for all outputs.
Key Novelty
Unified Theoretical Framework for GRPO-style Objectives
  • Formalizes a single surrogate objective equation that encompasses over 10 recent methods (including GRPO, GSPO, and Dr. GRPO) as special cases of weighting and regularization choices
  • Analytically proves that AdamW's adaptive moments effectively cancel out global reward scaling (making scalar tuning mechanisms futile) and drive parameter updates beyond intended clipping boundaries due to momentum overshoot
Evaluation Highlights
  • Analytically proved that 10 recent group-based methods (e.g., R1, GSPO, GTPO) share a unified form susceptible to systematic gradient biases on shared prefix tokens
  • Established theoretically that under AdamW without regularization (beta=0), multiplying rewards by any scalar factor has strictly zero effect on the optimization trajectory
  • Demonstrated that optimizer momentum forces parameters to drift outside the intended clipping region (1-epsilon, 1+epsilon) during multi-step updates, violating trust region guarantees
Breakthrough Assessment
8/10
Provides a crucial theoretical foundation for a widely used but poorly understood family of methods (GRPO). The identification of scale invariance and momentum overshoot challenges standard practices.
×