Evaluation Setup
Train on SFT, then train on RL (GRPO), then evaluate on unseen math benchmarks
Benchmarks:
- AIME24 (Mathematical Competition)
- AIME25 (Mathematical Competition)
- AMC23 (Mathematical Competition)
- GSM8K (Grade School Math)
- MATH500 (Advanced Math)
- GAOKAO-en (College Entrance Exam Math)
- OlympiadBench (Olympiad Math)
- College-MATH (College Math)
Metrics:
- Pass rate (Average across benchmarks)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance comparison of SED-SFT against standard Cross-Entropy (CE) and other SFT baselines after the subsequent RL phase. |
| Average (8 benchmarks) |
Pass Rate |
47.90 |
49.96 |
+2.06
|
| Average (8 benchmarks) |
Pass Rate |
62.00 |
63.20 |
+1.20
|
| Average (8 benchmarks) |
Pass Rate |
47.28 |
49.96 |
+2.68
|
| Average (8 benchmarks) |
Pass Rate |
46.33 |
49.96 |
+3.63
|
Main Takeaways
- SED-SFT consistently improves downstream RL performance compared to standard CE loss and other diversity-focused baselines (GEM, DFT)
- Baselines like DFT performed well during the SFT phase (accuracy-wise) but restricted exploration so much that RL could not recover, leading to worse final performance
- Analysis confirms SED-SFT increases sentence-level diversity (lower Self-BLEU) compared to CE and DFT