Understanding R1-Zero-Like Training: A Critical Perspective

📝 Paper Summary

LLM Post-training Reinforcement Learning for Reasoning Large Language Model Analysis

R1-Zero-like training relies on base models that already possess reasoning capabilities and requires correcting an optimization bias in GRPO that artificially promotes long, incorrect responses.

Core Problem

Replicating R1-Zero is challenging due to misconceptions about base model capabilities (e.g., 'Aha moments' are assumed to be purely RL-emergent) and optimization biases in the popular GRPO algorithm.

Why it matters:

The 'scaling of test-time compute' via reinforcement learning is a key frontier in LLM reasoning, but the mechanisms are poorly understood
The standard GRPO algorithm introduces a length bias that causes models to generate progressively longer incorrect responses (overthinking) without improving accuracy
Misinterpreting base model capabilities leads to incorrect attribution of performance gains to RL rather than pretraining

Concrete Example: When training with standard GRPO, a model's incorrect responses grow longer because the loss function divides by length, penalizing long incorrect answers less than short ones. Dr. GRPO removes this bias, stopping the 'wild' growth of incorrect reasoning chains.

Key Novelty

Dr. GRPO (GRPO Done Right) and Critical Base Model Analysis

Identifies that the standard GRPO objective effectively penalizes long incorrect responses less than short ones (due to length normalization), creating an artificial incentive for verbose failures
Proposes Dr. GRPO, which removes length and standard deviation normalization terms to recover an unbiased PPO-like objective while maintaining memory efficiency
Demonstrates that 'Aha moments' (self-correction) and strong reasoning are already present in base models like DeepSeek-V3-Base and Qwen2.5-Math, contradicting the belief they emerge solely from RL

Architecture

Illustration of the Optimization Bias in GRPO vs. Unbiased Objective

Evaluation Highlights

Achieves 43.3% accuracy on AIME 2024 with a 7B model (Qwen2.5-Math-7B) using the proposed minimalist recipe, establishing a new state-of-the-art for this size
Removing prompt templates improves Qwen2.5-Math-7B performance by +30.5 points (average across 5 benchmarks) compared to 4-shot prompting, suggesting SFT-like pretraining
Dr. GRPO significantly improves token efficiency compared to vanilla GRPO, preventing the explosion of response length for incorrect outputs while maintaining accuracy

Breakthrough Assessment

8/10

Provides critical, demystifying insights into R1-Zero replication. Identifies a fundamental flaw in a widely used algorithm (GRPO) and offers a simpler, unbiased fix (Dr. GRPO) yielding SOTA results.

⚙️ Technical Details

Problem Definition

Setting: Token-level Markov Decision Process (MDP) for reasoning tasks

Inputs: Math question q

Outputs: Reasoning chain and final answer o

Pipeline Flow

Input Question
Base Model Policy (Generation)
Reward Verification
Policy Update (Dr. GRPO)

System Modules

Base Model

Generates the reasoning chain and final answer

Model or implementation: Qwen2.5-Math-7B (also tested 1.5B, Llama-3.2-3B, DeepSeek-V3-Base)

Reward Verifier

Determines correctness of the final answer

Model or implementation: Rule-based (Math-Verify)

Novel Architectural Elements

None (Novelty is in the optimization algorithm and analysis, not the inference architecture)

Modeling

Base Model: Qwen2.5-Math-7B (primary), Qwen2.5-Math-1.5B, Llama-3.2-3B-FineMath

Training Method: Reinforcement Learning (Dr. GRPO)

Objective Functions:

Purpose: Optimize policy without length bias.

Formally: Dr. GRPO removes the 1/|o_i| and 1/std(R) terms found in standard GRPO, using Monte Carlo returns with an unbiased baseline.

Key Hyperparameters:

compute: 8x A100 GPUs
training_time: 27 hours (for 7B model)
learning_rate: Not explicitly reported in the paper
+ 1 more
batch_size: Not explicitly reported in the paper

Compute: 27 hours on 8x A100 GPUs for the 7B model recipe

Comparison to Prior Work

vs. DeepSeek-R1-Zero: Uses Qwen2.5 base instead of DeepSeek-V3-Base; uses Dr. GRPO instead of GRPO to fix length bias
vs. GRPO (Standard): Removes length normalization (1/|o|) and standard deviation division to prevent optimization bias towards long incorrect answers

Limitations

Analysis primarily focuses on math reasoning; generalization to coding or general domains is less explored
Relies on rule-based verification (outcome reward), so applicable mainly to tasks with clear ground truth
Does not explicitly report statistical significance tests for the performance gaps
Exact hyperparameters (learning rate, batch size) for the best run are not listed in the main text

Reproducibility

Code: https://github.com/sail-sg/understand-r1-zero

Code and models publicly available at https://github.com/sail-sg/understand-r1-zero. Minimalist recipe provided (Qwen2.5-Math-7B + Dr. GRPO + Qwen-Math template). Training data is MATH level 3-5. Hyperparameters for the specific run (LR, batch size) are not fully enumerated in the text but implied to be in the code.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using greedy decoding and max 3000 tokens

Benchmarks:

AIME 2024 (Competition Math)
AMC (Competition Math)
MATH500 (General Math QA)
Minerva Math (Math QA)
OlympiadBench (Competition Math)

Metrics:

Accuracy (Pass@1)
Pass@8
Average Token Length
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of base model capabilities reveals that Qwen2.5-Math models perform significantly better without any prompt templates, suggesting they were pretrained in an SFT-like manner.
Average (5 benchmarks)	Accuracy	50.8	81.3	+30.5
Average (5 benchmarks)	Accuracy	29.9	59.9	+30.0
RL Training results using the proposed Dr. GRPO algorithm.
AIME 2024	Accuracy	Not reported in the paper	43.3	Not reported in the paper

Experiment Figures

Response Length and Training Reward curves comparing GRPO and Dr. GRPO

Answering rate of different base models with different templates

Main Takeaways

Qwen2.5-Math models work best without templates (100% answering rate), suggesting they are already SFT-like, while Llama/DeepSeek require templates.
The 'Aha moment' (self-reflection) appears in DeepSeek-V3-Base before RL training, challenging the claim that it is purely an RL-emergent property.
Dr. GRPO prevents the 'overthinking' phenomenon where models generate excessively long incorrect responses, leading to better token efficiency than standard GRPO.
Domain-specific pretraining (e.g., FineMath) allows even weaker base models like Llama-3.2 to improve significantly via RL.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Policy Gradients)
Language Model Pretraining and Post-training
Chain-of-Thought (CoT) Reasoning

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same input, avoiding the need for a separate value network

Dr. GRPO: GRPO Done Right—the authors' proposed unbiased variant of GRPO that removes response-length normalization and standard deviation division to recover the standard PPO objective

Aha moment: The phenomenon where a model self-corrects or reflects during generation (e.g., saying 'Wait, let me recheck'), typically associated with advanced reasoning

SFT: Supervised Fine-Tuning—training a model on labeled examples (question-answer pairs)

PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to ensure stability

Token efficiency: The ratio of correct reasoning to generated token length; avoiding unnecessarily long incorrect responses

Overthinking: A failure mode where reasoning models generate excessively long chains of thought without reaching a correct answer, often exacerbated by optimization bias