Evaluation Setup
Mathematical reasoning tasks using rule-based verification for rewards.
Benchmarks:
- AIME 2024 (Mathematical Problem Solving)
- AIME 2025 (Mathematical Problem Solving)
- MATH 500 (Mathematical Problem Solving)
- AMC 2023 (Mathematical Problem Solving)
- Minerva (Scientific Reasoning)
- Olympiad Bench (Competition Math)
Metrics:
- pass@1 (averaged over k=16 responses)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| DisCO consistently outperforms baseline RL methods across aggregated mathematical benchmarks. |
| Average across 6 math benchmarks |
Relative Improvement (%) |
0 |
6 |
+6
|
Main Takeaways
- DisCO significantly outperforms GRPO and its variants (DAPO, Dr. GRPO) across multiple model sizes (1.5B, 7B, 8B)
- Removing the question-level weighting factor (difficulty bias) accelerates learning, particularly for very hard or very easy questions
- The constrained optimization approach maintains stable entropy levels throughout training, avoiding the collapse seen in clipping-based methods
- DisCO is token-efficient, achieving better results with 8k context length than DeepScaleR baselines using 24k context length