Evaluation Setup
Mathematical reasoning tasks requiring long Chain-of-Thought
Benchmarks:
- AIME 2024 (Mathematical Reasoning)
Metrics:
- Accuracy (Score)
- Training Stability (crash occurrence)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| AIME 2024 |
Score |
5 |
60.4 |
+55.4
|
| AIME 2024 |
Score |
50.4 |
60.4 |
+10.0
|
Main Takeaways
- VAPO significantly outperforms state-of-the-art value-model-free methods (DAPO, DeepSeek-R1-Zero-Qwen-32B) by over 10 points on AIME 2024.
- The method achieves high training stability, with no crashes reported across multiple independent runs, unlike typical value-based RL on complex tasks.
- Convergence is efficient, reaching state-of-the-art performance within 5,000 training steps.
- The results validate the hypothesis that value-model-based methods have a higher performance ceiling than value-model-free methods if the value model's bias and variance issues are addressed.