NGRPO: Negative-enhanced Group Relative Policy Optimization

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Mathematical Reasoning

NGRPO enables models to learn from homogeneously incorrect groups by introducing a virtual maximum-reward sample to generate negative advantages, stabilized by asymmetric clipping of the objective.

Core Problem

GRPO fails to learn from response groups where all answers are incorrect (homogeneous errors) because the zero variance in rewards leads to zero advantages and null gradients.

Why it matters:

Models miss valuable learning signals from collective failures, causing them to abandon difficult problems rather than exploring new solutions
Standard GRPO wastes training data for tasks with high difficulty (many all-wrong groups) or low difficulty (many all-correct groups)
Fixed-penalty alternatives like PSR-NSR can lead to training collapse due to aggressive, un-normalized negative advantages

Concrete Example: In a group where a model generates 8 responses and all are incorrect (0% accuracy), standard GRPO calculates a mean reward equal to each individual reward. The advantage for every sample becomes zero, resulting in no policy update despite the complete failure.

Key Novelty

Negative-enhanced Group Relative Policy Optimization (NGRPO)

Introduces 'Advantage Calibration' by adding a virtual maximum-reward sample to the group statistics. This ensures the mean reward is higher than any incorrect response in an all-wrong group, forcing a negative advantage.
Employ 'Asymmetric Clipping' in the PPO objective, applying stricter clipping to negative advantages and looser clipping to positive ones. This counteracts the strong exploration pressure created by the persistent negative bias of the virtual sample.

Architecture

Overview of the NGRPO framework, illustrating the Advantage Calibration and Asymmetric Clipping modules within the RL pipeline.

Evaluation Highlights

Demonstrates superior Pass@k AUC on AIME2025, a highly challenging benchmark, indicating balanced improvement in accuracy and exploration
Prevents entropy collapse during training compared to PPO and GRPO, maintaining robust exploration without the instability seen in PSR-NSR
Significantly alters advantage landscape: in low-accuracy groups, it dampens positive advantages (e.g., from 2.47 to 1.76) and increases penalties for errors to drive exploration

Breakthrough Assessment

8/10

Addresses a fundamental flaw in GRPO (learning from failure) with a mathematically grounded, simple solution. State-of-the-art results on hard math benchmarks validate the approach.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for mathematical reasoning

Inputs: Math problem prompt

Outputs: Reasoning trace and final answer

Pipeline Flow

Policy Model (generates group of G responses)
Reward Calculation (verifies answers)
Advantage Calibration (adds virtual r_max sample)
NGRPO Update (applies asymmetric clipping)

System Modules

Policy Model

Generate reasoning paths and answers for prompts

Model or implementation: Qwen2.5-Math-7B

Advantage Estimator

Calculate relative advantages for the group using the virtual sample

Model or implementation: Mathematical function (Eq 6)

Novel Architectural Elements

Injection of a virtual maximum-reward sample into the advantage normalization step (purely algorithmic change to the RL update rule)

Modeling

Base Model: Qwen2.5-Math-7B

Training Method: NGRPO (Negative-enhanced Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy while learning from failures and maintaining stability.

Formally: E[min(r_t A'_t, clip(r_t, 1-ε_neg, 1+ε_pos) A'_t)] - β D_KL
Purpose: Calibrate advantage to allow learning from homogeneous errors.

Formally: A'_i = (r_i - mean(R U {r_max})) / std(R U {r_max})

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 1024 (global)
epochs: 20
+ 4 more
clip_epsilon_positive: 0.24
clip_epsilon_negative: 0.16
temperature: 0.6
top_p: 0.95

Compute: Cluster of 8 NVIDIA H100 GPUs

Comparison to Prior Work

vs. GRPO: NGRPO learns from all-wrong groups via virtual sample injection
vs. DAPO: NGRPO utilizes homogeneous groups for exploration rather than discarding them
vs. PSR-NSR: NGRPO uses adaptive normalized advantages derived from group stats rather than fixed hard-coded values
+ 1 more
vs. PPO: NGRPO is critic-less and uses group-based relative advantages [standard GRPO difference]

Limitations

Requires verifiable rewards (ground truth answers), limiting applicability to open-ended tasks
Introduces persistent negative bias in advantages, requiring careful tuning of clipping parameters to prevent instability
Computational overhead of generating multiple samples per prompt (inherited from GRPO)

Reproducibility

Code: https://github.com/nangongrui-ngr/NGRPO

Code is publicly available. Hyperparameters for training and evaluation are explicitly detailed. Benchmark datasets (MATH, AMC23, AIME2025) are public. Specific benchmark scores (exact numbers for Pass@k AUC) are plotted in figures/tables in the paper but not included in the provided text snippet, preventing extraction of exact performance deltas.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on hard benchmarks

Benchmarks:

MATH500 (Mathematical problem solving)
AMC23 (Competition math (AMC 2023))
AIME2025 (High-difficulty competition math (AIME 2025))

Metrics:

Pass@k AUC (Area Under Curve for k=1 to 256)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Advantage Analysis: These results quantify how NGRPO alters the learning signal compared to GRPO in specific group scenarios, demonstrating the mechanism's adaptive behavior.
N/A (Analytical Case Study)	Advantage (Correct Sample)	2.47	1.76	-0.71
N/A (Analytical Case Study)	Advantage (Incorrect Sample)	-0.35	-0.50	-0.15

Experiment Figures

Comparison of advantage values assigned by GRPO vs. NGRPO across three scenarios: Homogeneous-Incorrect, Low-Accuracy Mixed, and High-Accuracy Mixed.

Dynamics of policy entropy during training for NGRPO and baselines.

Main Takeaways

NGRPO achieves state-of-the-art Pass@k AUC on AIME2025, outperforming GRPO and DAPO which struggle with the high difficulty (homogeneous failure) of this benchmark.
Ablation studies show that the Virtual Maximum-Reward Sample provides a substantial standalone boost, while Asymmetric Clipping offers marginal gains alone but works synergistically with the virtual sample to maximize performance.
Entropy analysis reveals that NGRPO maintains stable, converging entropy levels, whereas PSR-NSR exhibits high, unstable entropy (aggressive exploration) and GRPO/DAPO show declining entropy (loss of exploration).
The method is effective specifically because it converts homogeneous errors into gradients; methods that filter these out (DAPO) hit a performance ceiling on hard tasks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of sampled responses, eliminating the need for a critic model

RLVR: Reinforcement Learning with Verifiable Rewards—using objective ground-truth checks (like correct math answers) as reward signals

Homogeneous group: A batch of generated responses that are either all correct or all incorrect, resulting in zero reward variance within the group

Advantage Calibration: NGRPO's method of adding a virtual sample with maximum possible reward to the group statistics to prevent zero variance and ensure negative advantages for failures

Asymmetric Clipping: Modifying the PPO objective to use different clipping ranges (epsilon) for positive and negative advantages to balance exploration and stability

Pass@k: A metric estimating the probability that at least one correct answer exists within k generated samples

AUC: Area Under the Curve—here used to aggregate Pass@k performance across values of k from 1 to 256