Evaluation Setup
Mathematical reasoning tasks using verifiable rewards (correct final answer)
Benchmarks:
- GSM8K (Grade school math)
- MATH (Challenging math problems)
- AIME24 (Math competition)
- AMC (Math competition)
- Minerva Math (Math reasoning)
- OlympiadBench (Math olympiad)
Metrics:
- Accuracy (Pass@1)
- Rollout Time
- Total Training Time
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Efficiency results showing speedups of GRESO compared to GRPO with Dynamic Sampling (DS) on Qwen2.5-Math-7B. |
| Training Pipeline |
Rollout Speedup |
1.0 |
2.4 |
+1.4x (2.4x total)
|
| Training Pipeline |
Total Training Speedup |
1.0 |
2.0 |
+1.0x (2.0x total)
|
| Accuracy results demonstrating that GRESO maintains performance parity with the more expensive Dynamic Sampling baseline. |
| Average (6 Math Benchmarks) |
Accuracy |
61.3 |
61.5 |
+0.2
|
| Average (6 Math Benchmarks) |
Accuracy |
59.3 |
61.5 |
+2.2
|
Main Takeaways
- Zero-variance prompts (all responses identical) provide no learning signal in GRPO and constitute a large portion of training data (up to 80% in late stages).
- Prompt difficulty exhibits strong temporal consistency: prompts that are zero-variance in one epoch are >90% likely to remain so in the next.
- GRESO successfully exploits this consistency to skip uninformative rollouts, achieving ~2x training speedups without degrading model accuracy.
- Adaptive exploration is crucial: a small fraction of zero-variance prompts do become solvable later, so probabilistic rather than deterministic skipping is required.