← Back to Paper List

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
Qwen Team, Alibaba Inc
arXiv.org (2025)
RL Reasoning

📝 Paper Summary

Large Language Model Reinforcement Learning Policy Optimization Algorithms
GSPO stabilizes RL training for large language models by applying importance sampling and clipping at the sequence level rather than the token level, preventing the accumulation of high-variance noise.
Core Problem
Current RL algorithms like GRPO exhibit severe instability when training gigantic models (especially MoEs) on long responses, often causing irreversible model collapse due to ill-posed token-level objectives.
Why it matters:
  • Instability prevents scaling RL to larger models and longer reasoning chains necessary for advanced math/coding tasks
  • Existing methods like GRPO require complex, memory-intensive workarounds like 'Routing Replay' to train MoE models
  • Token-level importance weights in GRPO are mathematically misapplied, introducing high-variance noise that accumulates over long sequences
Concrete Example: In a 48-layer MoE model, ~10% of activated experts change between the old and new policy after a single update. GRPO's token-level weighting fluctuates drastically due to this shift, causing gradients to explode and the model to collapse, whereas GSPO remains stable by looking at the whole sequence.
Key Novelty
Sequence-Level Importance Sampling for RL
  • Shifts the unit of optimization from individual tokens to entire sequences, matching the granularity of the reward signal
  • Calculates importance ratios based on the likelihood of the whole response, ensuring the 'off-policy' correction is mathematically valid
  • Eliminates the need for 'Routing Replay' in MoE training because sequence-level likelihoods are robust to internal expert routing volatility
Evaluation Highlights
  • Achieves superior training efficiency and benchmark performance compared to GRPO on AIME'24 and LiveCodeBench using Qwen3-30B-A3B-Base
  • Inherently stabilizes Mixture-of-Experts (MoE) training without requiring the memory-heavy 'Routing Replay' strategy needed by GRPO
  • Maintains stability even when clipping significantly larger fractions of tokens (two orders of magnitude more than GRPO), proving the noise-reduction capability
Breakthrough Assessment
9/10
Identifies a fundamental mathematical flaw in state-of-the-art RL (GRPO) and fixes it with a theoretically grounded sequence-level approach. It solves a critical instability issue for MoE models and simplifies infrastructure.
×