Geometric-Mean Policy Optimization

📝 Paper Summary

Reasoning in Large Language Models Reinforcement Learning from Verifiable Rewards

GMPO stabilizes Group Relative Policy Optimization (GRPO) by maximizing the geometric mean of token-level rewards instead of the arithmetic mean, reducing sensitivity to outlier importance sampling ratios.

Core Problem

GRPO suffers from unstable updates because its arithmetic mean objective is sensitive to outlier importance weights, forcing the use of narrow clipping ranges that hinder exploration.

Why it matters:

Unstable policy updates lead to degraded model performance and training collapse in reasoning tasks
Narrow clipping ranges (required for stability in GRPO) cause early deterministic policies, limiting the model's ability to explore and self-correct during training
Current methods struggle to balance the trade-off between training stability and sufficient exploration for scaling reasoning capabilities

Concrete Example: During training, GRPO often encounters tokens with extreme importance sampling ratios (e.g., far from 1.0). Because GRPO averages these arithmetically, a single outlier can dominate the gradient, causing aggressive updates. To prevent this, GRPO must clip ratios tightly (e.g., 0.8 to 1.2), which stops the model from learning from high-deviation but potentially informative trajectories.

Key Novelty

Geometric-Mean Policy Optimization (GMPO)

Replaces the arithmetic mean of token-level importance weights with the geometric mean in the optimization objective
The geometric mean is inherently less sensitive to outliers, naturally suppressing extreme values in the importance sampling ratio distribution
Allows for a significantly wider clipping range (e^{-0.4}, e^{0.4}) compared to GRPO, enabling greater exploration without sacrificing stability

Architecture

Comparison of GRPO vs. GMPO objectives and their effect on importance sampling ratio stability during training.

Evaluation Highlights

+4.1% average Pass@1 accuracy over GRPO on five math benchmarks (AIME24, AMC, MATH500, Minerva, OlympiadBench) using DeepSeek-R1-Distill-Qwen-7B
+2.1% Pass@1 accuracy on MATH500 using a Qwen-32B Mixture-of-Experts model compared to GRPO
+1.4% Pass@1 accuracy on the Geometry3K multimodal benchmark using Qwen2.5-VL-Instruct-7B

Breakthrough Assessment

8/10

Simple yet theoretically grounded modification to a dominant algorithm (GRPO). Solves a critical stability-exploration trade-off and shows consistent gains across diverse models and tasks.

⚙️ Technical Details

Problem Definition

Setting: Post-training of Large Language Models via Reinforcement Learning using group-based relative rewards

Inputs: Input prompt q (e.g., math problem)

Outputs: Generated reasoning chain and final answer o

Pipeline Flow

Policy Model (Sampling) → Reward Computation → Advantage Estimation → Optimization (GMPO Update)

System Modules

Policy Model

Generate G responses for a given question q

Model or implementation: DeepSeek-R1-Distill-Qwen-7B / Qwen2.5-Math / Qwen-32B-MoE

Reward & Advantage

Compute rewards and normalized advantages within the group

Model or implementation: Deterministic Reward Function (Verifiable)

GMPO Optimizer

Update policy parameters using Geometric Mean objective

Model or implementation: Optimization Algorithm

Novel Architectural Elements

Geometric-Mean Objective: Replaces arithmetic averaging of token ratios with geometric averaging in the loss function
Token-level Clipping with Wider Range: Clips ratios at individual token level (not sequence level) with a wider range (e^-0.4, e^0.4)

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-7B, Qwen2.5-Math-{1.5B, 7B}, Qwen-32B-MoE, Qwen2.5-VL-Instruct-7B

Training Method: Geometric-Mean Policy Optimization (GMPO)

Objective Functions:

Purpose: Stabilize updates by suppressing outliers in importance sampling.

Formally: Maximize E[sgn(Â_i) * (Product of clipped ratios)^(1/|o|) * |Â_i|]

Adaptation: Full fine-tuning

Training Data:

Math Tasks: MATH Levels 3–5 (8,523 problems)
MoE Models: DeepScaleR and CountDown datasets
Multimodal: Geometry3K dataset

Key Hyperparameters:

group_size: 8 rollouts per question
learning_rate: Not reported in the paper
batch_size: 128 (questions per batch)
+ 3 more
clip_range: (e^-0.4, e^0.4) ≈ (0.67, 1.49)
kl_coefficient: 0 (DKL ignored following Dr.GRPO)
max_response_length: 3000 tokens

Compute: 8x A800 GPUs

Comparison to Prior Work

vs. GRPO: Uses geometric mean instead of arithmetic mean; enables wider clipping range
vs. DAPO: GMPO uses a much wider clipping range (approx 0.67-1.49 vs 0.8-1.28) while maintaining stability
vs. DeepSeek-R1 (GRPO variant): GMPO uses token-level clipping rather than sequence-level clipping

Limitations

No direct analysis of wall-clock training time compared to GRPO (though theoretical complexity is similar)
Mainly evaluated on reasoning tasks with verifiable rewards; applicability to open-ended generation is less clear
Hyperparameters like learning rate are missing from the text

Reproducibility

Code: https://github.com/callsys/GMPO

Publicly available code at https://github.com/callsys/GMPO. Uses standard datasets (MATH, AIME, etc.). Hyperparameters like learning rate are not explicitly detailed in the main text but batch size and clip ranges are.

📊 Experiments & Results

Evaluation Setup

Greedy decoding (temp=0) for language tasks; Sampling (temp=0.5, 16 samples) for multimodal

Benchmarks:

AIME 2024 (High-school Olympiad Math)
AMC (Intermediate Math)
MATH500 (Algebra, Geometry, Number Theory)
Minerva (Graduate-level Math)
OlympiadBench (High-difficulty Olympiad Math)
Geometry3K (Multimodal Geometry Reasoning)

Metrics:

Pass@1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Consistent improvements over GRPO across different model sizes on mathematical reasoning tasks.
Average (5 Math Benchmarks)	Pass@1	59.3	63.4	+4.1
Average (5 Math Benchmarks)	Pass@1	42.5	43.9	+1.4
MATH500	Pass@1	94.6	96.7	+2.1
Geometry3K	Pass@1	53.3	54.7	+1.4
Average (5 Math Benchmarks)	Pass@1	51.2	52.7	+1.5

Experiment Figures

Training dynamics: Token Entropy, Gradient Norm, KL Divergence, and Validation Score over training steps.

Main Takeaways

Geometric Mean objective consistently outperforms Arithmetic Mean (GRPO) by being robust to outliers.
GMPO maintains higher token entropy throughout training compared to GRPO, indicating better exploration and prevention of premature convergence.
The method allows for wider clipping ranges (e^-0.4, e^0.4) than standard GRPO (0.8, 1.2), effectively balancing stability and exploration.
Token-level clipping is superior to sequence-level clipping (used in DeepSeek-R1) as it preserves more informative gradient signals.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO framework)
Importance Sampling
Policy Gradients
KL Divergence

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines by averaging rewards within a group of sampled outputs rather than using a separate value model

Importance Sampling Ratio: The ratio between the probability of a token under the current policy versus the old policy; used to correct for off-policy data

Geometric Mean: A type of mean calculated by multiplying N numbers and taking the Nth root; less sensitive to extreme outliers than the arithmetic mean

Pass@1: The percentage of problems where the model generates the correct answer in its first attempt

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution

PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates to be within a small range to ensure stability

Token Entropy: A measure of the randomness or uncertainty in the model's token predictions; higher entropy generally indicates more exploration

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer