Evaluation Setup
Evaluation on mathematical and free-form reasoning tasks using accuracy metrics.
Benchmarks:
- Mathematical Benchmarks (Mathematical Reasoning)
- MMLU-Pro (Free-form Natural Reasoning)
Metrics:
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| EMPO significantly boosts accuracy over base models on both math and general reasoning benchmarks without supervision. |
| Mathematical Benchmarks |
Accuracy |
30.7 |
48.1 |
+17.4
|
| MMLU-Pro |
Accuracy |
32.1 |
50.1 |
+18.0
|
Main Takeaways
- EMPO achieves competitive performance compared to supervised counterparts on both math and free-form reasoning.
- Semantic entropy serves as a potent intrinsic reward signal, showing a strong negative correlation with model accuracy.
- The method works by selecting and prioritizing strong, pre-existing reasoning pathways learned during pre-training rather than teaching new skills from scratch.
- Entropy thresholding helps stabilize unsupervised training by filtering out unreliable traces.