← Back to Paper List

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, E. Barsoum
Advanced Micro Devices Inc., The Hong Kong University of Science and Technology (Guangzhou)
arXiv.org (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Reasoning Process Reward Models (PRMs) Test-Time Scaling
Athena trains effective Process Reward Models using only ~5,000 samples by filtering automated labels through consistency checks between weak and strong completer models, drastically reducing data requirements.
Core Problem
Training Process Reward Models (PRMs) requires step-level labels that are expensive to annotate manually and noisy or computationally prohibitive to estimate automatically via Monte Carlo sampling.
Why it matters:
  • High-quality step-level feedback is crucial for complex multi-step reasoning in math and visual tasks, where Outcome Reward Models (ORMs) provide insufficient signal
  • Existing automated labeling methods (like Math-Shepherd) require hundreds of thousands of samples and massive compute to estimate labels via thousands of rollouts
  • Noisy labels from automated methods degrade reward model performance, as weak models may fail on correct steps and strong models may recover from incorrect ones
Concrete Example: A weak completer (e.g., 7B model) might fail to solve a problem even starting from a correct intermediate step, falsely labeling that step as 'incorrect'. Conversely, a strong completer (e.g., 72B) might recover from a subtle error in an intermediate step and solve the problem, falsely labeling the error as 'correct'. Standard Monte Carlo methods average these biases, creating noisy training data.
Key Novelty
Consistency-Filtered Process Labeling
  • Uses two distinct models for Monte Carlo estimation: a 'weak' completer and a 'strong' completer
  • Retains only those reasoning steps where both completers agree on the outcome (correct vs. incorrect), filtering out ambiguous or bias-prone labels
  • Initializes the fine-grained Process Reward Model (PRM) from a coarse-grained Outcome Reward Model (ORM) to leverage large-scale solution-level supervision before fine-tuning on steps
Evaluation Highlights
  • +10.2 points accuracy improvement on WeMath benchmark using Qwen2.5-VL-7B as the policy model with Athena-PRM verification
  • Achieves State-of-the-Art on VisualProcessBench with 83.1 F1 score, outperforming the previous best open-source model (VisualPRM-8B) by 3.9 points
  • Reduces computational costs significantly: requires only 1/45th of the GPU hours for data synthesis and 1/60th for training compared to vanilla Monte Carlo estimation baselines
Breakthrough Assessment
7/10
Significant for its extreme data efficiency (5K vs 300K samples) and practical methodology for training PRMs without human annotation. While the architecture is standard, the data curation strategy addresses a major bottleneck in reasoning research.
×