Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

📝 Paper Summary

Large-scale Reinforcement Learning on LLMs Reasoning capabilities (Chain-of-Thought)

Open-Reasoner-Zero demonstrates that vanilla PPO with specific GAE settings and no KL regularization enables stable, scalable reasoning reinforcement learning on base models, achieving superior efficiency over DeepSeek-R1-Zero.

Core Problem

Reproducing the scaling laws of reasoning-oriented RL (like DeepSeek-R1-Zero) is difficult due to training instability, lack of open implementation details, and the complexity of tuning algorithms like GRPO.

Why it matters:

Current proprietary models (o1, DeepSeek-R1) show reasoning scales with compute, but the methods are not fully democratized
Standard RLHF relies on complex KL regularization and SFT warm-ups, which may limit exploration potential on base models
GRPO lacks a value function for precise token-level credit assignment, leading to instability like infinite repetition loops

Concrete Example: When using GRPO, a model might fall into a repetitive loop of generating the same phrase; without a critic to devalue these specific redundant tokens, the policy collapses. PPO's critic identifies this redundancy as a low-value state, correcting the behavior.

Key Novelty

Minimalist PPO for Reasoner-Zero

Replaces GRPO with vanilla PPO using a learned critic to provide better advantage estimation and credit assignment for reasoning steps
Eliminates KL regularization completely to allow maximal exploration without 'alignment tax' or reference model overhead
Simplifies GAE to a bias-free configuration (gamma=1, lambda=1) that treats the entire reasoning chain as equally important for the final reward

Evaluation Highlights

Achieves superior performance on AIME 2024, MATH-500, and GPQA Diamond compared to DeepSeek-R1-Zero-Qwen-32B (using the same base model)
Reduces training steps to 1/10th of the DeepSeek-R1-Zero pipeline requirement while maintaining scalability
Demonstrates that unaligned base models can self-learn formatting constraints purely through binary outcome rewards, without specific format-shaping rewards

Breakthrough Assessment

9/10

Significantly democratizes 'O1-like' training by providing the first open implementation that simplifies the recipe (PPO > GRPO, No KL) while outperforming the previous state-of-the-art open reproduction.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning on reasoning tasks using a base Large Language Model

Inputs: Natural language question q

Outputs: Chain-of-thought reasoning trace and final answer, enclosed in specific tags

Pipeline Flow

Prompt Generation
Rollout (Policy Model)
Reward Computation (Rule-based)
Advantage Estimation (GAE)
Update (PPO Actor-Critic)

System Modules

Policy Model

Generates the reasoning trace and answer for a given question

Model or implementation: Qwen2.5-32B Base

Reward Engine

Evaluates the correctness of the final answer

Model or implementation: Rule-based script

Critic Model

Estimates the value of the current state (generated tokens) to assist in advantage calculation

Model or implementation: Value Network (parameterized by phi)

Novel Architectural Elements

Removal of Reference Model: Architecture does not load a reference model for KL calculation, saving memory
Simplified GAE integration: Sets gamma=1 and lambda=1, effectively reducing the advantage calculation to A_t = R - V(s_t)

Modeling

Base Model: Qwen2.5-32B Base (and Qwen2.5-7B for ablation)

Training Method: PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while limiting policy deviation per step.

Formally: Standard PPO clipped surrogate objective.
Purpose: Minimize error in value estimation.

Formally: MSE between V(s) and target returns.
Purpose: Estimate advantage with full lookahead and no bias.

Formally: GAE with gamma=1, lambda=1, simplifying to A_t = R_terminal - V(s_t).

Adaptation: Full model training (RL from base)

Training Data:

Aggregated from AIME, MATH, Numina-Math, Tulu3 MATH, OpenR1-Math-220k, AoPS
Filtered for difficulty and checkability (removed proof-oriented problems)

Key Hyperparameters:

gae_lambda: 1
gae_gamma: 1
ppo_clip_epsilon: 0.2
+ 1 more
kl_coefficient: 0 (No KL regularization)

Compute: 1/10th of the training steps compared to DeepSeek-R1-Zero pipeline (exact GPU hours not reported in snippet)

Comparison to Prior Work

vs. DeepSeek-R1-Zero: Uses PPO instead of GRPO for better stability; removes format rewards; removes KL regularization; uses 1/10th training steps
vs. Standard RLHF (e.g., Llama 2 Chat): Starts from base model (no SFT); uses binary outcome reward only (no reward model); no KL penalty

Limitations

Requires verifiable rewards (e.g., math/code), making it less applicable to open-ended creative writing where ground truth is absent
Relies on large-scale data curation; training on small datasets (like MATH train only) leads to plateaus
Critic model adds memory overhead compared to GRPO (though authors argue the stability benefit outweighs this)

Reproducibility

The authors state they release source code, training parameters, curated data, model weights, and critic model weights. However, the specific URL is not included in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on competitive mathematics and reasoning benchmarks.

Benchmarks:

AIME 2024 (Challenging Mathematics Problems)
MATH-500 (Mathematics Problems)
GPQA Diamond (Graduate-Level General Purpose QA)

Metrics:

Pass Rate / Accuracy (implied, specific metric not named in snippet)
Response Length (to measure scaling of reasoning)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The provided text asserts superior performance and efficiency gains but does not contain the specific results table with numeric accuracy scores. Therefore, specific accuracy entries are omitted to strictly follow the 'No fabricated numbers' rule. The efficiency claim is relative and included below.
Training Steps	Relative Steps	1.0	0.1	-0.9

Experiment Figures

Evolution of response format compliance during training

Main Takeaways

Vanilla PPO with GAE (gamma=1, lambda=1) is sufficient for scaling reasoning, contradicting the need for complex GRPO or KL regularization.
The learned critic in PPO is essential for 'credit assignment', effectively identifying and devaluing repetitive loops that cause GRPO to collapse.
Data scale and diversity are critical; training on limited data (MATH train set) plateaus quickly, while the curated large-scale dataset shows continuous improvement.
Explicit format rewards are unnecessary; unaligned base models can learn required output formats (e.g., <answer> tags) solely from binary correctness rewards.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GAE)
Large Language Models (LLMs)
Chain-of-Thought (CoT) prompting

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that uses a clipped surrogate objective to ensure stable policy updates

GRPO: Group Relative Policy Optimization—an RL algorithm used by DeepSeek-R1 that estimates advantages by averaging rewards within a group of samples, avoiding a learned value function

GAE: Generalized Advantage Estimation—a method to estimate the 'advantage' (how good an action is) by balancing bias and variance

KL regularization: Kullback-Leibler divergence penalty—usually added to RL rewards to keep the model close to its initial state; removed in this paper

Reasoner-Zero: A training paradigm where a base LLM is trained via RL directly to reason, without prior Supervised Fine-Tuning (SFT)

Credit assignment: The problem of determining which specific past actions (tokens) contributed to the final reward

Value function: A learned network (Critic) that predicts the expected future reward from a given state

Discount factor: Gamma—a parameter in RL that determines how much future rewards are valued compared to immediate ones