Evaluation Setup
Mathematical reasoning on competitive problems
Benchmarks:
- AIME 2024 (Competition-level Mathematics)
Metrics:
- Accuracy (avg@32)
- Statistical methodology: Repeated evaluation 32 times and reported average to ensure stability
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| AIME 2024 |
Accuracy |
47 |
50 |
+3
|
| AIME 2024 |
Accuracy |
30 |
50 |
+20
|
Main Takeaways
- DAPO outperforms DeepSeek's RL method on the same base model (Qwen2.5-32B) while converging in half the steps.
- Vanilla GRPO suffers significantly from entropy collapse and reward noise, capping performance at ~30% on AIME.
- Filtering out zero-gradient samples (Dynamic Sampling) accelerates convergence by ensuring every batch provides useful learning signals.
- Decoupled clipping (Clip-Higher) is critical for maintaining exploration and preventing the policy from becoming deterministic too early.