← Back to Paper List

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang
Fudan University, Shanghai AI Laboratory, Shanghai Jiaotong University, Tsinghua University, Nanjing University, The Chinese University of Hong Kong, SenseTime Research
arXiv.org (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Reasoning Reward Modeling Test-Time Scaling
VisualPRM enhances multimodal reasoning by using an 8B-parameter critic model trained on 400K automatically annotated reasoning steps to select the best solution paths during inference.
Core Problem
Multimodal Large Language Models (MLLMs) struggle with complex reasoning, and Test-Time Scaling (TTS) is ineffective because existing open-source models make poor critics due to a lack of process supervision data.
Why it matters:
  • Current open-source MLLMs show only marginal improvements with Best-of-N strategies because they cannot accurately estimate solution quality
  • There is a lack of benchmarks for evaluating multimodal critic models, making it hard to assess progress in error detection
  • Proprietary models outperform open-source models significantly in reasoning, creating a capability gap
Concrete Example: When an MLLM generates a multi-step math solution based on an image, it may make a subtle error in step 3. Without a specialized Process Reward Model, standard scoring methods (like self-consistency) might fail to catch this intermediate error, accepting a wrong final answer.
Key Novelty
VisualPRM (Multimodal Process Reward Model)
  • Constructs a massive dataset (VisualPRM400K) by using Monte Carlo sampling to estimate the 'expected accuracy' of 2 million reasoning steps
  • Trains a critic model to predict the correctness of each step in a multi-turn chat format, enabling fine-grained quality estimation
  • Uses this critic to guide Best-of-N inference, selecting solutions with the most valid reasoning steps rather than just checking the final answer
Architecture
Architecture Figure Figure 2
Conceptual illustration of the VisualPRM400K data sample structure and the VisualProcessBench annotation format.
Evaluation Highlights
  • +5.9 points improvement on InternVL2.5-78B average accuracy across seven multimodal reasoning benchmarks (e.g., MMMU, MathVista) using VisualPRM
  • +8.4 points improvement on InternVL2.5-8B average accuracy across the same seven benchmarks
  • +8.0 points improvement on MiniCPM-V2.6 average accuracy, outperforming Outcome Reward Models and Self-Consistency methods
Breakthrough Assessment
8/10
Addresses a critical gap in multimodal reasoning (lack of effective process reward models) with a large-scale dataset, a new benchmark, and significant quantitative gains across multiple model scales.
×