Evaluation Setup
Best-of-N verification (selecting best solution from N=64 candidates) and Direct Step Evaluation
Benchmarks:
- WeMath (Multimodal Math Reasoning)
- MathVista (Visual Math QA)
- VisualProcessBench (Step-level Correctness Judgment)
- MATH (Text-only Math Reasoning)
Metrics:
- Accuracy (Pass@1, Best-of-N)
- F1 Score (for step classification)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Test-time scaling results demonstrate Athena-PRM's ability to select correct solutions from generated candidates, significantly improving over the base policy model. |
| WeMath |
Accuracy (Best-of-64) |
40.5 |
50.7 |
+10.2
|
| MathVista |
Accuracy (Best-of-64) |
57.6 |
64.7 |
+7.1
|
| MATH |
Accuracy (Best-of-64) |
39.5 |
48.4 |
+8.9
|
| Direct evaluation of step correctness shows Athena-PRM outperforms existing judges and PRMs. |
| VisualProcessBench |
F1 Score |
79.2 |
83.1 |
+3.9
|
| Data efficiency comparison shows Athena-PRM (5K samples) outperforms vanilla MC methods (300K samples). |
| MATH |
Accuracy |
47.2 |
48.4 |
+1.2
|
Main Takeaways
- High-quality data is far more important than quantity for PRMs: 5K filtered samples outperform 300K noisy samples.
- Consistency between weak and strong completers effectively removes the bias inherent in Monte Carlo estimation.
- Initializing Process Reward Models from Outcome Reward Models (ORM) provides a strong 'pre-training' foundation, treating outcome supervision as coarse-grained process supervision.
- Up-sampling negative steps handles the inherent label imbalance in reasoning traces (where correct steps usually outnumber incorrect ones), improving discriminator performance.