Evaluation Setup
Evaluation on held-out math and general reasoning benchmarks
Benchmarks:
- GSM8K (Math Word Problems)
- MATH (Challenging Math Problems)
- MMLU-Pro (General Multi-task Reasoning)
- SuperGPQA (Graduate-Level Reasoning)
Metrics:
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Average (Math Benchmarks) |
Accuracy |
44.57 |
51.06 |
+6.49
|
| Average (Math Benchmarks) |
Accuracy |
49.18 |
54.69 |
+5.51
|
| Average (General Reasoning) |
Accuracy |
32.06 |
39.60 |
+7.54
|
| Average (General Reasoning) |
Accuracy |
36.25 |
41.38 |
+5.13
|
| Average (Math) |
Accuracy |
43.5 |
51.1 |
+7.6
|
| Average (Math) |
Accuracy |
56.51 |
58.86 |
+2.35
|
Main Takeaways
- R-Zero successfully improves reasoning capability from zero data, consistently across model sizes (3B, 4B, 8B)
- Math-focused training generalizes significantly to general-domain reasoning (MMLU-Pro, GPQA), suggesting fundamental reasoning skills are learned
- Larger models are more resilient to the eventual performance collapse observed in iterative self-training
- Task filtering based on answer consistency is critical; without it, performance degrades significantly due to noise