← Back to Paper List

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, H. Shum
Tsinghua University
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Large-scale Reinforcement Learning on LLMs Reasoning capabilities (Chain-of-Thought)
Open-Reasoner-Zero demonstrates that vanilla PPO with specific GAE settings and no KL regularization enables stable, scalable reasoning reinforcement learning on base models, achieving superior efficiency over DeepSeek-R1-Zero.
Core Problem
Reproducing the scaling laws of reasoning-oriented RL (like DeepSeek-R1-Zero) is difficult due to training instability, lack of open implementation details, and the complexity of tuning algorithms like GRPO.
Why it matters:
  • Current proprietary models (o1, DeepSeek-R1) show reasoning scales with compute, but the methods are not fully democratized
  • Standard RLHF relies on complex KL regularization and SFT warm-ups, which may limit exploration potential on base models
  • GRPO lacks a value function for precise token-level credit assignment, leading to instability like infinite repetition loops
Concrete Example: When using GRPO, a model might fall into a repetitive loop of generating the same phrase; without a critic to devalue these specific redundant tokens, the policy collapses. PPO's critic identifies this redundancy as a low-value state, correcting the behavior.
Key Novelty
Minimalist PPO for Reasoner-Zero
  • Replaces GRPO with vanilla PPO using a learned critic to provide better advantage estimation and credit assignment for reasoning steps
  • Eliminates KL regularization completely to allow maximal exploration without 'alignment tax' or reference model overhead
  • Simplifies GAE to a bias-free configuration (gamma=1, lambda=1) that treats the entire reasoning chain as equally important for the final reward
Evaluation Highlights
  • Achieves superior performance on AIME 2024, MATH-500, and GPQA Diamond compared to DeepSeek-R1-Zero-Qwen-32B (using the same base model)
  • Reduces training steps to 1/10th of the DeepSeek-R1-Zero pipeline requirement while maintaining scalability
  • Demonstrates that unaligned base models can self-learn formatting constraints purely through binary outcome rewards, without specific format-shaping rewards
Breakthrough Assessment
9/10
Significantly democratizes 'O1-like' training by providing the first open implementation that simplifies the recipe (PPO > GRPO, No KL) while outperforming the previous state-of-the-art open reproduction.
×