Evaluation Setup
Evaluated on ability to detect the first error step in math reasoning trajectories.
Benchmarks:
- ProcessBench (Step-level Error Detection (Real-world errors))
- PRMBench (Step-level Error Detection (Synthetic/Heuristic errors))
Metrics:
- F1 Score (identifying the first error step)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| ActPRM achieves state-of-the-art performance on ProcessBench while significantly reducing annotation costs compared to baselines. |
| ProcessBench |
F1 Score |
74.3 |
75.0 |
+0.7
|
| ProcessBench |
F1 Score |
73.5 |
75.0 |
+1.5
|
| PRMBench |
F1 Score |
65.3 |
65.5 |
+0.2
|
| ProcessBench |
F1 Score |
0.673 |
0.673 |
0.0
|
| ProcessBench |
F1 Score |
0.640 |
0.673 |
+0.033
|
Main Takeaways
- ActPRM matches full-dataset performance with only 50% of the annotations in pool-based settings, validating the efficiency of uncertainty-based filtering.
- Combining aleatoric (confidence) and epistemic (ensemble disagreement) uncertainty yields better selection than using either alone.
- Ensemble size matters: Performance of uncertainty estimation improves with more heads, converging around 32 heads.
- The method scales effectively to large datasets (1M+ samples), setting new SOTA results with a fraction of the compute used by prior leaders.