Evaluation Setup
Evaluation of mathematical reasoning on 6 benchmarks using SFT-then-RLVR paradigm.
Benchmarks:
- GSM8K (Grade school math)
- MATH (Challenging competition math)
- AIME (High-school math competition)
- AMC (American Math Competitions)
- OlympiadBench (Olympiad-level math)
- GaoKao (Chinese college entrance exam math)
Metrics:
- Pass@1
- Pass@k
- Policy Entropy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results on Qwen2.5-1.5B-Math show OXA significantly outperforms conventional SFT across averaged benchmarks. |
| Average (6 benchmarks) |
Pass@1 |
Not reported in the paper |
Not reported in the paper |
+6.6
|
| Average (6 benchmarks) |
Pass@k |
Not reported in the paper |
Not reported in the paper |
+5.5
|
Main Takeaways
- OXA consistently improves mathematical reasoning performance across diverse benchmarks (GSM8K, MATH, AIME, etc.) compared to standard SFT.
- The method successfully mitigates entropy collapse: OXA-trained models exhibit higher policy entropy than SFT models, indicating a broader exploration space.
- Performance gains from OXA are persistent; they are maintained throughout the subsequent extensive RLVR training phase.
- The approach is effective across different model scales (tested on 1.5B and 7B parameters).