Group Sequence Policy Optimization

📝 Paper Summary

Large Language Model Reinforcement Learning Policy Optimization Algorithms

GSPO stabilizes RL training for large language models by applying importance sampling and clipping at the sequence level rather than the token level, preventing the accumulation of high-variance noise.

Core Problem

Current RL algorithms like GRPO exhibit severe instability when training gigantic models (especially MoEs) on long responses, often causing irreversible model collapse due to ill-posed token-level objectives.

Why it matters:

Instability prevents scaling RL to larger models and longer reasoning chains necessary for advanced math/coding tasks
Existing methods like GRPO require complex, memory-intensive workarounds like 'Routing Replay' to train MoE models
Token-level importance weights in GRPO are mathematically misapplied, introducing high-variance noise that accumulates over long sequences

Concrete Example: In a 48-layer MoE model, ~10% of activated experts change between the old and new policy after a single update. GRPO's token-level weighting fluctuates drastically due to this shift, causing gradients to explode and the model to collapse, whereas GSPO remains stable by looking at the whole sequence.

Key Novelty

Sequence-Level Importance Sampling for RL

Shifts the unit of optimization from individual tokens to entire sequences, matching the granularity of the reward signal
Calculates importance ratios based on the likelihood of the whole response, ensuring the 'off-policy' correction is mathematically valid
Eliminates the need for 'Routing Replay' in MoE training because sequence-level likelihoods are robust to internal expert routing volatility

Evaluation Highlights

Achieves superior training efficiency and benchmark performance compared to GRPO on AIME'24 and LiveCodeBench using Qwen3-30B-A3B-Base
Inherently stabilizes Mixture-of-Experts (MoE) training without requiring the memory-heavy 'Routing Replay' strategy needed by GRPO
Maintains stability even when clipping significantly larger fractions of tokens (two orders of magnitude more than GRPO), proving the noise-reduction capability

Breakthrough Assessment

9/10

Identifies a fundamental mathematical flaw in state-of-the-art RL (GRPO) and fixes it with a theoretically grounded sequence-level approach. It solves a critical instability issue for MoE models and simplifies infrastructure.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement learning from query-response pairs with sparse rewards

Inputs: Query x

Outputs: Response y (sequence of tokens)

Pipeline Flow

Policy generates G responses for query x
Verifier scores responses to get rewards
Compute advantage for group
Compute sequence-level importance ratio
Update Policy

System Modules

Policy Model

Generate responses to queries

Model or implementation: Qwen3-30B-A3B-Base (MoE)

Verifier

Assign scalar rewards to generated responses

Model or implementation: External reward function/model

Novel Architectural Elements

Substitution of token-level clipping mechanism with sequence-level clipping mechanism in the optimization loop

Modeling

Base Model: Qwen3-30B-A3B-Base

Training Method: Group Sequence Policy Optimization (GSPO)

Objective Functions:

Purpose: Optimize policy to maximize reward while staying close to old policy.

Formally: Maximize E[ min( s_i(θ) * A_i, clip(s_i(θ), 1-ε, 1+ε) * A_i ) ] where s_i(θ) is the sequence-level importance ratio.

Training Data:

Not explicitly detailed (queries used for RL rollout)

Key Hyperparameters:

GSPO left clipping range: 3e-4
GSPO right clipping range: 4e-4
GRPO left clipping range: 0.2
+ 2 more
GRPO right clipping range: 0.27
mini_batches per rollout: 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: GSPO uses sequence-level importance ratios and clipping instead of token-level. GSPO weights all tokens in a response equally rather than varying weights per token.
vs. PPO: GSPO does not require a separate value model (like GRPO), but differs by using sequence-level constraints.
vs. DPO [not cited in paper]: GSPO is an online RL method using rollouts and rewards, whereas DPO is typically offline using preference pairs.

Limitations

No specific limitations explicitly acknowledged in the text provided
Relies on the assumption that sequence-level reward alignment is always superior to token-level granular control (though GSPO-token variant is proposed)

Reproducibility

Code: https://github.com/QwenLM/Qwen2.5

No replication artifacts (code, data, scripts) mentioned in the paper. The method is described mathematically. Base model is Qwen3-30B-A3B-Base.

📊 Experiments & Results

Evaluation Setup

RL fine-tuning of a cold-start MoE model on math and code tasks

Benchmarks:

AIME'24 (Mathematics)
LiveCodeBench (Code Generation)
CodeForces (Competitive Programming)

Metrics:

Pass@1 (average over samplings)
Elo Rating
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of training stability and performance between GSPO and GRPO.

Fractions of clipped tokens/samples in GSPO vs GRPO.

Impact of Routing Replay on GRPO stability.

Main Takeaways

GSPO achieves stable training dynamics where GRPO fails or requires 'Routing Replay' for MoE models.
GSPO allows for continuous performance improvement with increased compute and generation length, unlike GRPO which faces collapse.
GSPO is robust to precision discrepancies between training and inference engines, enabling disaggregated frameworks.
Clipping a larger fraction of tokens (via sequence clipping) paradoxically leads to better efficiency, suggesting token-level gradients in GRPO are noisy.
Eliminates the need for Routing Replay, reducing memory/communication overhead for MoE training.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Importance Sampling
Mixture-of-Experts (MoE) Architecture
Language Model Likelihoods

Key Terms

GSPO: Group Sequence Policy Optimization—the proposed RL algorithm that clips and updates policies based on sequence-level likelihood ratios

GRPO: Group Relative Policy Optimization—a baseline RL algorithm that normalizes rewards within a group of outputs and uses token-level clipping

Importance Sampling: A technique to estimate properties of a target distribution using samples from a different (behavior) distribution by weighting them

MoE: Mixture-of-Experts—a model architecture where different 'expert' sub-networks are activated for different inputs, often causing training instability in RL

Routing Replay: A stabilization strategy for GRPO on MoEs that forces the new policy to reuse the specific experts activated by the old policy to reduce variance

Model Collapse: A failure mode in RL where the model's output quality degrades drastically and irreversibly during training

KL regularization: A penalty term keeping the trained policy close to the reference policy to prevent reward hacking or mode collapse

Off-policy: Learning from data generated by an older version of the policy rather than the current one being optimized