Evaluation Setup
Reasoning tasks across mathematical and general domains.
Benchmarks:
- MATH500 (Mathematical Reasoning)
- AMC23 (Mathematical Reasoning)
- AIME2024 (Mathematical Reasoning)
- AIME2025 (Mathematical Reasoning)
- SuperGPQA (General Domain Reasoning (Science/Academic))
- MMLU-Pro (General Domain Reasoning)
Metrics:
- pass@1
- Statistical methodology: Average performance over 16 evaluation runs reported for mathematical datasets.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| General Domain Results (trained on WebInstruct) show CER consistently outperforming baselines on MMLU-Pro and SuperGPQA. |
| MMLU-Pro |
pass@1 |
47.5 |
48.1 |
+0.6
|
| SuperGPQA |
pass@1 |
32.8 |
33.5 |
+0.7
|
| Mathematical Domain Results (trained on MATH-7.5K) show CER is competitive with highly specific Rule-based verifiers and outperforms model-based verifiers. |
| MATH500 |
pass@1 |
59.2 |
58.6 |
-0.6
|
| MATH500 |
pass@1 |
59.2 |
60.1 |
+0.9
|
Main Takeaways
- CER is domain-agnostic: It works well on both math and general reasoning without changing the formulation.
- CER provides denser signals than Exact Match: Soft rewards allow learning from partially correct or semantically equivalent answers that fail strict string matching.
- Efficiency via sample reuse: Tensorized computation allows CER to be computed using the same samples generated for exploration, adding negligible training overhead.
- Complementarity: CER combines effectively with rule-based verifiers (Rule+CER) to boost performance further, correcting the sparsity of rules with the softness of CER.