MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning

📝 Paper Summary

Online Reinforcement Learning for LLMs Reasoning Tasks (Math)

MC-GRPO stabilizes reinforcement learning with small rollout budgets by using a median-centered baseline instead of a mean-centered one, preventing outlier-induced advantage sign flips.

Core Problem

In Group Relative Policy Optimization (GRPO), using a small number of rollouts (e.g., 2-4) makes the shared mean baseline noisy and sensitive to outliers.

Why it matters:

Resource-constrained settings (academic labs, limited GPUs) cannot afford the high rollout counts (8-32) required for stable GRPO training
Noisy baselines cause 'advantage sign flips', where good completions are penalized and bad ones rewarded, reversing the optimization direction and degrading final accuracy

Concrete Example: With a budget of 4 rollouts, a single high-reward outlier shifts the group mean so much that a completion with positive reward might end up below the mean, receiving a negative advantage. In an 8-rollout setting, that same completion would have correctly received a positive advantage.

Key Novelty

Median-Centered Advantage Estimation

Replace the mean baseline with the median, which is robust to outliers, ensuring stable advantage signs even with few samples
Sample one extra rollout (G+1) to form an odd-sized group, use the median as the baseline, and exclude the median sample (which has zero advantage) from the gradient update to preserve the exact compute cost of standard G-rollout training

Architecture

Pseudocode for the MC-GRPO algorithm

Evaluation Highlights

+4.62% accuracy improvement on GSM8K using Qwen3-1.7B with only 2 rollouts compared to standard GRPO
Reduces the performance gap between 2-rollout and 8-rollout training to within 1% across various models
Consistent gains on out-of-distribution math competitions (AIME 2024, AMC 2023) when trained on Math-500

Breakthrough Assessment

8/10

Simple, mathematically grounded fix for a specific but prevalent failure mode in resource-constrained RL. The method is algorithm-agnostic and yields significant gains at low compute.

⚙️ Technical Details

Problem Definition

Setting: Online Reinforcement Learning with Group Relative Policy Optimization

Inputs: Prompt q

Outputs: Generated completion o

Pipeline Flow

Sampling: Generate G+1 completions for prompt q
Reward Evaluation: Compute rewards for all completions
Advantage Computation: Calculate median baseline and MAD
Filtering: Exclude the median completion (zero advantage)
Update: Optimization step using remaining G completions

System Modules

Policy Model

Generate G+1 rollouts per prompt

Model or implementation: Various LLMs (Qwen3-1.7B, Llama-3.2-3B, Qwen2.5-7B)

Reward Function

Assign scalar scores to completions

Model or implementation: Rule-based

Advantage Estimator

Compute median-centered advantages

Model or implementation: Statistical function

Novel Architectural Elements

Median-centered advantage estimation block replacing mean-centered block in GRPO loop

Modeling

Base Model: Qwen3-1.7B, Llama-3.2-3B-Instruct, Qwen2.5-Math-1.5B, Qwen3-4B-Instruct, Qwen2.5-7B-Instruct

Training Method: MC-GRPO (Median-Centered Group Relative Policy Optimization)

Objective Functions:

Purpose: Compute robust relative advantages.

Formally: A_i = (r_i - median(r)) / (MAD(r) + epsilon)
Purpose: Update policy using clipped surrogate objective.

Formally: L = E[min(rho * A, clip(rho, 1-e, 1+e) * A)]

Adaptation: Full fine-tuning or LoRA (not explicitly specified, implies standard RL fine-tuning)

Trainable Parameters: Full model parameters typically

Training Data:

GSM8K training split
MATH (Math-500 subset) training split

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
rollout_budget_G: Varied: 2, 4, 8
clip_epsilon: Standard PPO clipping (implied)
+ 1 more
beta_KL: Standard KL penalty (implied)

Compute: Equivalent to standard GRPO with G rollouts (extra rollout for median is excluded from backward pass)

Comparison to Prior Work

vs. GRPO: Uses median instead of mean for baseline; robust to small batch sizes
vs. PPO [not cited in paper]: Does not require a separate value network (critic), unlike PPO
vs. RLOO [not cited in paper]: MC-GRPO focuses on group-based normalization rather than leave-one-out baselines

Limitations

Computational overhead of generating one extra rollout (G+1) during inference, even if it is dropped for the gradient update
Benefits diminish as rollout budget (G) increases (e.g., G=8 or G=16), where mean estimation becomes stable
Mainly validated on math reasoning tasks; applicability to more subjective tasks (e.g., creative writing) with continuous reward models is less explored

Reproducibility

Code: https://github.com/lotusroot-kim/MC-GRPO

Code is publicly available. The paper relies on standard datasets (GSM8K, MATH) and open weights models (Qwen, Llama). Hyperparameters are described as following standard GRPO recipes.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks with rule-based rewards

Benchmarks:

GSM8K (Grade school math problems)
Math-500 (Challenging math problems (subset of MATH))
AIME 2024 (Math competition problems (OOD))
AMC 2023 (Math competition problems (OOD))

Metrics:

Accuracy (Exact Match)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance improvements on GSM8K using Qwen3-1.7B at different rollout budgets.
GSM8K	Accuracy	78.90	83.54	+4.64
GSM8K	Accuracy	82.35	84.70	+2.35
GSM8K	Accuracy	84.53	84.95	+0.42
Generalization to out-of-distribution (OOD) math competition benchmarks.
AIME 2024	Accuracy	13.6	15.9	+2.3
AMC 2023	Accuracy	39.2	44.6	+5.4

Experiment Figures

Analysis of advantage sign flips and their correlation with rollout budget

Impact of injected sign noise on final accuracy

Training reward curves for GRPO vs MC-GRPO

Main Takeaways

MC-GRPO significantly outperforms standard GRPO in low-rollout regimes (G=2, 4) by reducing advantage sign flips.
The method is robust across different model families (Qwen, Llama) and scales (1.5B to 7B).
Improvements transfer to out-of-distribution tasks, suggesting the learned policy is more robust.
Median-centering can be applied as a plug-and-play modification to other GRPO variants like DAPO and DR-GRPO with similar benefits.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Policy Gradients)
Group Relative Policy Optimization (GRPO)
Robust Statistics (Median, MAD)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same prompt, removing the need for a separate value network

advantage: A value indicating how much better an action was compared to a baseline; positive advantage reinforces the action, negative advantage suppresses it

rollout: A complete sequence of text generated by the model in response to a prompt during the training process

sign flip: When a noisy baseline causes the calculated advantage of a trajectory to change from positive to negative (or vice versa) compared to the 'true' advantage, reversing the learning signal

MAD: Median Absolute Deviation—a robust measure of variability used here to normalize advantages instead of standard deviation

clipping: Limiting the policy update size (e.g., in PPO/GRPO) to prevent the new policy from deviating too wildly from the old one

KL regularization: A penalty term that prevents the trained model from drifting too far from its original pre-trained distribution