Evaluation Setup
Downstream performance on Math/Science reasoning tasks using PRM for Data Selection, RL, or Inference Scaling.
Benchmarks:
- AIME (Math Competition Problems)
- MATH500 (Math Problem Solving)
- GPQA-Diamond (Graduate-Level Science QA)
Metrics:
- Accuracy (Pass@1)
- Best-of-N Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance gains when using ReasonFlux-PRM to select training data for Supervised Fine-Tuning (SFT) compared to baselines. |
| Average (AIME, MATH500, GPQA) |
Accuracy Improvement |
Not reported in the paper |
Not reported in the paper |
+12.1%
|
| Performance gains when using ReasonFlux-PRM as a reward signal in Reinforcement Learning (RL). |
| Average (AIME, MATH500, GPQA) |
Accuracy Improvement |
Not reported in the paper |
Not reported in the paper |
+4.5%
|
| Performance gains when using ReasonFlux-PRM for Test-Time Scaling (Best-of-N). |
| Average (AIME, MATH500, GPQA) |
Accuracy Improvement |
Not reported in the paper |
Not reported in the paper |
+6.3%
|
Main Takeaways
- ReasonFlux-PRM-7B selects higher quality training data than the much larger Qwen2.5-Math-PRM-72B and human-curated sets, reversing the trend where PRMs typically degrade data quality for thinking trajectories.
- Existing PRMs struggle with 'thinking trajectories' due to structural mismatch (branching vs linear) and lack of specific training data.
- Consistent improvements across SFT, RL, and Inference settings validate the robustness of the trajectory-aware reward formulation.