SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

📝 Paper Summary

Zero Reinforcement Learning (Zero RL) Post-training of Large Language Models Reasoning Emergence in LLMs

SimpleRL-Zoo demonstrates that applying reinforcement learning directly to diverse open base models triggers emergent reasoning and significant performance gains without supervised fine-tuning, provided format rewards and data difficulty are carefully managed.

Core Problem

Prior work on zero RL training (training RL directly on base models) primarily focused on the Qwen2.5 series, which already possesses strong instruction-following and reasoning capabilities, leaving it unclear if these benefits extend to weaker or more diverse base models.

Why it matters:

It is unknown whether the 'aha moment' (emergent self-verification) is a general phenomenon or specific to high-capability base models like DeepSeek-V3 or Qwen2.5
Current recipes often rely on rigid format rewards (e.g., answer boxing) or SFT warm-starts, which may actually hinder exploration and reasoning development in weaker models
Understanding the minimal ingredients for reasoning emergence is crucial for democratizing advanced AI capabilities beyond top-tier proprietary labs

Concrete Example: When Mistral-7B is trained with a rigid format reward (must use \boxed{}), it collapses early, generating repetitive gibberish to satisfy the format without solving the problem. In contrast, removing this constraint allows it to explore and improve.

Key Novelty

SimpleRL-Zoo (Evaluation & Recipe)

Systematic evaluation of 'Zero RL' (RL without SFT) across 10 diverse base models (Llama-3, Mistral, DeepSeek-Math, Qwen2.5), verifying that reasoning emergence is a general phenomenon
Identification of 'format reward' and 'data difficulty' as critical failure points: strict formatting kills exploration in weak models, while overly hard data causes collapse
Discovery that traditional SFT warm-starts (using short CoT data) actually suppress the emergence of advanced reasoning behaviors like backtracking compared to pure Zero RL

Architecture

Evolution of accuracy and response length across training iterations for 8 different base models.

Evaluation Highlights

Mistral-Small-24B improves accuracy from 27.6% to 49.6% (average across 6 math benchmarks) via Zero RL, surpassing its base model significantly
DeepSeek-Math-7B improves from 11.3% to 29.2% average accuracy, tripling its performance without any supervised fine-tuning
Zero RL models generalize to unseen tasks: Mistral-Small-24B gains +16.0 points on MMLU (Stem) and +24.8 points on GPQA, despite training only on 8K math problems

Breakthrough Assessment

9/10

Provides the first comprehensive empirical evidence that the 'DeepSeek-R1 effect' (emergent reasoning via pure RL) replicates across diverse open models, debunking the need for SFT warm-starts and establishing a robust recipe for the community.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from a pre-trained base model without Supervised Fine-Tuning (Zero RL)

Inputs: Math problems q from GSM8K and MATH datasets

Outputs: Chain-of-thought reasoning trajectory t followed by a final answer a

Pipeline Flow

Base Model (Policy) generates group of outputs for prompt q
Reward Function evaluates outputs (correctness only, no format constraints)
GRPO Algorithm updates Policy based on relative rewards within the group

System Modules

Policy Model

Generate reasoning traces and answers

Model or implementation: Various Base Models (Llama-3.1-8B, Mistral-7B, Qwen-2.5 series, etc.)

Reward Function

Compute rewards for generated outputs

Model or implementation: Rule-based checker

GRPO Updater

Update model weights using group relative policy optimization

Model or implementation: GRPO Algorithm

Novel Architectural Elements

Removal of format-enforcing rewards in the RL pipeline to allow unconstrained exploration for weaker base models
Curriculum-free direct RL on base models (Zero RL) validated as superior to SFT-then-RL pipelines for reasoning emergence

Modeling

Base Model: 10 distinct models: Llama-3.1-8B, Mistral-7B-v0.1, Mistral-Small-24B, DeepSeek-Math-7B, Qwen-2.5 (0.5B, 1.5B, 7B, 14B, 32B), Qwen-2.5-Math-7B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to reference model.

Formally: GRPO objective maximizing advantage (normalized reward) subject to KL divergence constraint.

Adaptation: Full model update (implied, not explicitly restricted to LoRA)

Training Data:

Training set: GSM8K and MATH training splits
Data split by difficulty: Easy (GSM8K + MATH lv1), Medium (MATH lv1-4), Hard (MATH lv3-5)

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
group_size_G: Not reported in the paper
+ 1 more
beta_kl: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1-Zero: Validates the approach on much smaller (0.5B-32B) and diverse (Llama, Mistral) models, not just 671B MoE
vs. Open-Reasoner-Zero: Investigates beyond Qwen series; finds Qwen models are outliers with pre-existing reasoning capabilities
vs. Standard RLHF: Skips the SFT phase entirely; relies on sparse rule-based rewards rather than a learned reward model
+ 1 more
vs. Traditional SFT-then-RL: Shows that SFT warm-start limits the upper bound of reasoning performance compared to pure Zero RL

Limitations

Exploration of hyperparameters is mentioned but exact values (LR, batch size) are not provided in text
Analysis relies heavily on math benchmarks; applicability to coding or general reasoning is less explored
Format rewards are shown to be harmful for weak models, but how to eventually enforce format for production deployment is not fully addressed

Reproducibility

Code: https://github.com/hkust-nlp/simpleRL-reason

publicly available (https://github.com/hkust-nlp/simpleRL-reason). Code, models, and analysis tools are released. However, exact training hyperparameters (LR, batch size) are not detailed in the paper text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks evaluated using pass@1 and pass@k

Benchmarks:

GSM8K (Grade school math word problems)
MATH (High school competition math)
OlympiadBench (Olympiad-level math and science problems)
AIME 2024 (American Invitational Mathematics Examination)
GPQA-Diamond (Graduate-level science QA)

Metrics:

Accuracy (Greedy / Pass@1)
Pass@k
Response Length
Reasoning Behavior Ratio (using GPT-4o to classify behaviors like 'Verification', 'Backtracking')
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero RL training consistently improves accuracy across diverse base models on an average of 6 math benchmarks.
Average (6 math benchmarks)	Accuracy	27.6	49.6	+22.0
Average (6 math benchmarks)	Accuracy	11.3	29.2	+17.9
Average (6 math benchmarks)	Accuracy	10.6	22.0	+11.4
AIME 2024	Pass@1	10.0	36.7	+26.7
Generalization experiments show that math-only Zero RL training improves performance on out-of-domain science and general knowledge tasks.
GPQA-Diamond	Accuracy	20.2	45.0	+24.8
MMLU Stem	Accuracy	30.9	73.9	+43.0
Ablation on SFT warm-start shows that starting from a base model (Zero RL) yields better long-term performance and reasoning behaviors than starting from an SFT model.
Average (6 math benchmarks)	Accuracy	40	48	+8

Experiment Figures

Frequency of specific reasoning behaviors (Backtracking, Verification, Enumeration, Subgoal Setting) over training iterations for different models.

Comparison of Accuracy and Response Length with vs. without format rewards.

Main Takeaways

Zero RL is effective for diverse base models (Llama, Mistral, DeepSeek), not just Qwen, though Qwen starts with higher initial capabilities.
The 'aha moment' (emergence of verification/backtracking) is observed in small non-Qwen models (Llama-3-8B, DeepSeek-Math-7B) for the first time.
Strict format rewards (e.g., boxing answers) can cause collapse in weaker models by penalizing correct exploration; removing them is key for stability.
Data difficulty must be matched to model capability; weak models collapse on hard data, while strong models underperform on easy data.
Traditional SFT warm-start (using short CoT) hinders the emergence of advanced reasoning behaviors like backtracking compared to pure Zero RL.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO or GRPO)
Chain-of-Thought (CoT) prompting
Supervised Fine-Tuning (SFT)
Large Language Model post-training pipelines

Key Terms

Zero RL training: Reinforcement learning applied directly to a pre-trained base model without an intermediate supervised fine-tuning (SFT) stage

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same prompt to reduce variance, eliminating the need for a separate value function critic

aha moment: The point during training where a model spontaneously exhibits advanced reasoning behaviors like self-verification or backtracking without being explicitly taught them

pass@k: A metric measuring the probability that at least one correct answer is found in k generated samples

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training a model on demonstrated examples (input-output pairs) to teach it specific behaviors or formats

format reward: A reward signal given specifically for adhering to a structural constraint (e.g., enclosing the answer in \boxed{}) rather than correctness

cold start: Initializing the RL training process with a model that has already undergone SFT, rather than the raw base model