DAPO: An Open-Source LLM Reinforcement Learning System at Scale

📝 Paper Summary

Large-Scale Reinforcement Learning for LLMs Reasoning Models (Chain-of-Thought)

DAPO is an open-source reinforcement learning system that stabilizes large-scale reasoning training by decoupling clipping bounds, dynamically sampling informative prompts, and shaping length-based rewards.

Core Problem

Reproducing state-of-the-art reasoning models like DeepSeek-R1 is difficult because standard RL algorithms (PPO, GRPO) suffer from entropy collapse, training instability, and reward noise when scaled to long chain-of-thought tasks.

Why it matters:

Key technical details of top reasoning models (OpenAI o1, DeepSeek R1) are concealed, hindering community reproduction
Naive application of algorithms like GRPO leads to rapid entropy collapse, where the model stops exploring and outputs deterministic, suboptimal responses
Zero-gradient samples (where all outputs are correct or incorrect) waste up to 50% of computational resources during training

Concrete Example: When training Qwen2.5-32B with standard GRPO, the policy entropy collapses quickly, causing the model to generate nearly identical responses for a prompt. This prevents it from exploring complex reasoning paths, stalling accuracy at 30% on AIME 2024 compared to 50% with DAPO.

Key Novelty

Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)

Decouples the PPO clipping range: uses a higher upper bound to allow low-probability 'exploration' tokens to increase their likelihood faster, preventing the model from getting stuck in local optima
Dynamically filters out 'zero-gradient' prompts (where all group responses have identical rewards) during rollout, ensuring every training batch contains only informative samples with variance
Shifts from sample-level to token-level loss weighting to ensure long Chain-of-Thought sequences influence updates proportionally to their length

Architecture

The DAPO training loop logic

Evaluation Highlights

Achieves 50.0% accuracy on AIME 2024 using Qwen2.5-32B, surpassing the 47% reported by DeepSeek-R1-Zero-Qwen-32B
Outperforms naive GRPO baseline (30% accuracy) by 20 percentage points on AIME 2024
Reaches state-of-the-art performance using only 50% of the training steps required by DeepSeek-R1-Zero

Breakthrough Assessment

9/10

Significantly democratizes 'O1-like' reasoning training by open-sourcing a stable, reproducible recipe that matches or beats proprietary baselines with greater efficiency.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from verifiable rewards (RLVR) for mathematical reasoning

Inputs: Math problem prompt q

Outputs: Chain-of-thought reasoning trace and final integer answer a

Pipeline Flow

Input Processing (Prompt formatting)
Reasoning Generation (Qwen2.5-32B generating long CoT)
Output Verification (Rule-based integer extraction)

System Modules

Qwen2.5-32B (Base)

Generate long chain-of-thought reasoning and final answer

Model or implementation: Qwen2.5-32B-Instruct (initialized)

Rule-based Verifier

Check correctness of final answer against ground truth

Model or implementation: Deterministic function

Novel Architectural Elements

Dynamic Sampling Pipeline: The rollout stage actively discards prompts where all G responses yield identical rewards, continuing sampling until the batch is full of informative data
Token-Level Loss Aggregation: Loss is computed by summing token losses across the entire batch and normalizing by total tokens, rather than averaging per sample first

Modeling

Base Model: Qwen2.5-32B

Training Method: DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)

Objective Functions:

Purpose: Optimize policy while preventing entropy collapse by allowing larger updates for low-probability tokens.

Formally: Uses asymmetric clipping with epsilon_high > epsilon_low.
Purpose: Eliminate wasted computation from uninformative samples.

Formally: Filters out groups where std(Reward) = 0.
Purpose: Ensure long reasoning chains are weighted proportionally to their length.

Formally: Token-Level Policy Gradient Loss.

Training Data:

DAPO-Math-17K dataset
17K prompts with integer answers selected/transformed from web and competition sources

Key Hyperparameters:

learning_rate: 1e-6
optimizer: AdamW
rollout_batch_size: 512 prompts * 16 responses = 8192
+ 7 more
mini_batch_size: 512
clip_epsilon_low: 0.2
clip_epsilon_high: 0.28
max_generation_length: 20,480 tokens (16,384 expected + 4,096 buffer)
warmup_steps: 20
inference_temperature: 1.0
inference_top_p: 0.7

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1-Zero: DAPO uses asymmetric clipping (Clip-Higher) and dynamic sampling to prevent entropy collapse and improve efficiency
vs. Naive GRPO: DAPO uses token-level loss normalization instead of sample-level to handle varying CoT lengths better
vs. Standard PPO: DAPO eliminates the critic model (like GRPO) but adds stability features specifically for long-context reasoning

Limitations

Evaluation is limited to a single domain (mathematics) and benchmark (AIME 2024)
Requires verifiable rewards (ground truth answers), making it less applicable to open-ended creative tasks
Dynamic sampling may increase wall-clock time per step if many samples are filtered out (though total steps to convergence decrease)

Reproducibility

Code: https://dapo-sia.github.io/

Highly reproducible. The paper explicitly open-sources the training code (based on verl framework), the DAPO-Math-17K dataset, and the algorithm details. Hyperparameters are fully listed.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on competitive problems

Benchmarks:

AIME 2024 (Competition-level Mathematics)

Metrics:

Accuracy (avg@32)
Statistical methodology: Repeated evaluation 32 times and reported average to ensure stability

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AIME 2024	Accuracy	47	50	+3
AIME 2024	Accuracy	30	50	+20

Experiment Figures

AIME 2024 Accuracy curves during training for DAPO vs. DeepSeek-R1-Zero

Analysis of entropy and KL divergence during training

Main Takeaways

DAPO outperforms DeepSeek's RL method on the same base model (Qwen2.5-32B) while converging in half the steps.
Vanilla GRPO suffers significantly from entropy collapse and reward noise, capping performance at ~30% on AIME.
Filtering out zero-gradient samples (Dynamic Sampling) accelerates convergence by ensuring every batch provides useful learning signals.
Decoupled clipping (Clip-Higher) is critical for maintaining exploration and preventing the policy from becoming deterministic too early.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Chain-of-Thought (CoT) Prompting
Policy Gradients

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs for the same prompt, eliminating the need for a critic model

Entropy Collapse: A failure mode in RL where the policy becomes deterministic too early, stopping exploration and degrading performance

Importance Sampling Ratio: The ratio of the probability of an action under the new policy to its probability under the old policy, used in PPO to constrain updates

Zero-Gradient Data: Prompts where all generated responses in a group receive the exact same reward (e.g., all correct), resulting in zero advantage and zero gradient signal

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for reasoning capabilities