GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning

📝 Paper Summary

Reinforcement Learning for Reasoning Reinforcement Fine-Tuning (RFT)

GPG simplifies reinforcement learning for reasoning by removing critic models and reference policies, instead using group-level reward normalization and a corrected gradient estimator to achieve state-of-the-art performance.

Core Problem

Existing RL methods like PPO and GRPO are complex and resource-intensive, relying on critic models, reference models, or biased advantage estimators that complicate training and limit scalability.

Why it matters:

PPO requires training separate critic models and maintaining reference models, doubling memory usage and computational cost
GRPO introduces reward bias through its specific normalization strategy and still relies on KL divergence constraints
Simplifying RL is crucial for scaling reasoning capabilities to larger models without incurring prohibitive infrastructure costs

Concrete Example: In GRPO, if a batch contains only correct answers (reward=1) or only wrong answers (reward=0), the standard deviation is 0, causing division errors or requiring heuristic fixes. GPG handles this naturally via a thresholding mechanism and accurate gradient estimation.

Key Novelty

Group Policy Gradient (GPG)

Directly optimizes the RL objective without surrogate losses (like PPO's clip) or KL divergence constraints
Eliminates the need for a critic model by using group-based reward normalization as the baseline
Introduces 'Accurate Gradient Estimation' (AGE) to correct bias when samples in a group have identical rewards (all right/wrong), ensuring valid gradient updates

Architecture

Performance comparison bar charts (Top: Math, Bottom: Multimodal) showing GPG's superiority over baselines.

Evaluation Highlights

+14.0% improvement over Qwen2.5-Math-7B base model on MATH-500 accuracy (43.7% → 57.7%)
+9.9% improvement over GRPO baseline on average across 5 math benchmarks using 7B models (47.8% vs 57.7%)
+16.68% improvement over GRPO on CV-Bench visual reasoning tasks using Qwen2-VL-2B

Breakthrough Assessment

8/10

Significant simplification of RL pipelines (removing critic/ref models) while outperforming complex baselines like GRPO and PPO across diverse modalities. High practical value for efficient training.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Fine-Tuning (RFT) where a policy generates reasoning chains and answers given a prompt

Inputs: Question/Instruction s

Outputs: Action a (reasoning steps + final answer), receiving reward r

Pipeline Flow

Sampling: Generate G outputs for a single prompt using current policy
Reward Calculation: Score each output (e.g., 1 for correct, 0 for incorrect)
Advantage Estimation: Calculate advantage using group mean as baseline
Gradient Update: Optimize policy using Accurate Gradient Estimation (AGE)

System Modules

Policy Model

Generate reasoning steps and answers; updated via RL

Model or implementation: DeepSeek-R1-Distill-Qwen-1.5B or Qwen2.5-Math-7B (depending on experiment)

Reward Engine

Assign binary rewards based on answer correctness

Model or implementation: Rule-based checker

Gradient Estimator (AGE)

Compute policy gradients with scaling factor alpha to correct for valid sample count

Model or implementation: Mathematical formula (Equation 7)

Novel Architectural Elements

Removal of Critic and Reference Models: The pipeline relies solely on the active policy and group statistics
Accurate Gradient Estimation (AGE) logic: Dynamic scaling factor alpha = B / (B - M) applied to loss to correct for samples with zero advantage contribution

Modeling

Base Model: Qwen2.5-Math-7B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen2-VL-2B (multimodal)

Training Method: Group Policy Gradient (GPG)

Objective Functions:

Purpose: Maximize expected return directly via policy gradient.

Formally: J_GPG = E [ Sum( -log pi(o|q) * A_hat ) ]
Purpose: Correct gradient bias from samples with zero advantage.

Formally: Loss = alpha * Loss_standard, where alpha = B / (B - M)

Key Hyperparameters:

group_size: 8
learning_rate: Not explicitly reported in the paper
beta_th: 1.67 (threshold for valid sample partition, calculated as 1/0.6)
+ 1 more
kl_coefficient: 0 (No KL constraint used)

Compute: Trained on NVIDIA H20 GPUs. GPG reduces memory usage significantly (e.g., 24G vs 28G+ for DAPO-7B) and training cost (0.45x vs DAPO).

Comparison to Prior Work

vs. GRPO: GPG removes KL constraint, removes 'clip' term, and corrects gradient bias via AGE [cited in paper]
vs. PPO: GPG removes critic and reference models entirely, reducing memory and complexity [cited in paper]
vs. DAPO: GPG achieves higher performance (57.7% vs 56.0%) with 0.45x training cost and simpler implementation [cited in paper]
+ 1 more
vs. Dr. GRPO: GPG outperforms Dr. GRPO's reported results (47.8% vs 43.7% on Math-7B avg) despite Dr. GRPO's focus on reward bias [cited in paper]

Limitations

Not evaluated on extremely large models (e.g., 70B+) due to computational constraints
Relies on tasks where rewards are well-defined (math/reasoning), applicability to open-ended creative generation is less clear
Requires a minimum partition of valid samples (mixed rewards in a group), handled by thresholding but potentially inefficient if model is perfect or completely wrong

Reproducibility

Code: https://github.com/AMAP-ML/GPG

Code available at https://github.com/AMAP-ML/GPG. Uses open datasets (MATH-lighteval, Open-S1, GEOQA, etc.). Implementation details like AGE formula and thresholding are explicitly provided in Algorithm 1.

📊 Experiments & Results

Evaluation Setup

Zero-shot pass@1 evaluation on mathematical and multimodal reasoning benchmarks

Benchmarks:

AIME24 (Mathematical Reasoning)
MATH-500 (Mathematical Reasoning)
Minerva (Mathematical Reasoning)
CV-Bench (Visual Reasoning)
GEOQA (Geometry Reasoning)

Metrics:

Accuracy (Pass@1)
Average Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Unimodal mathematical reasoning results using 7B models show GPG consistently outperforming baselines.
Average (5 Math Datasets)	Average Accuracy	30.9	57.7	+26.8
Average (5 Math Datasets)	Average Accuracy	46.6	57.7	+11.1
Average (5 Math Datasets)	Average Accuracy	51.4	57.7	+6.3
Multimodal reasoning results demonstrate generalization to visual domains.
CV-Bench	Total Score	59.47	76.15	+16.68
GEOQA	Accuracy	47.48	51.33	+3.85
Ablation study on GPG components validates the need for Accurate Gradient Estimation (AGE).
Average (Math Datasets)	Average Accuracy	43.9	47.8	+3.9

Experiment Figures

Analysis of sample difficulty and reward variance over training steps.

Main Takeaways

GPG consistently outperforms GRPO and PPO across both unimodal (math) and multimodal (vision) tasks without needing critic models.
Accurate Gradient Estimation (AGE) is critical; simply removing the critic without correcting for gradient bias yields suboptimal results.
The method is highly resource-efficient, achieving SOTA results with ~45% of the training cost of comparable methods like DAPO.
Removing KL divergence constraints did not harm performance, contrary to common RLHF practices; in fact, adding constraints hurt performance in ablations.

📚 Prerequisite Knowledge

Prerequisites

Policy Gradient Theorem
Reinforcement Learning (PPO, TRPO)
Large Language Models (LLMs)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing outputs within a group of samples for the same prompt, removing the critic

PPO: Proximal Policy Optimization—an RL algorithm using a clipped surrogate objective and a critic model to stabilize training

Critic Model: A separate neural network in RL that estimates the value (expected return) of a state, used to reduce variance in gradient estimation

Reference Model: A frozen copy of the initial policy used in RL (via KL divergence penalty) to prevent the trained model from drifting too far from its original behavior

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from another

SFT: Supervised Fine-Tuning—training on labeled input-output pairs

Advantage Function: A function quantifying how much better a specific action is compared to the average action in that state

AGE: Accurate Gradient Estimation—a technique proposed here to correct gradient scaling when a subset of samples in a group yields zero advantage (all same reward)