← Back to Paper List

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, Yueting Zhuang
Zhejiang University, Nanyang Technological University, National University of Singapore
arXiv.org (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Reward Models Chain-of-Thought (CoT) Reasoning
SVIP automatically creates step-by-step multimodal reward models by generating Python code to solve visual tasks, executing the code to verify logic and facts, and translating valid execution traces into training data.
Core Problem
Training reward models for multimodal LLMs is difficult because manual step-by-step annotation is expensive, and existing methods rely on coarse, one-dimensional 'final answer' outcomes that ignore intermediate reasoning errors.
Why it matters:
  • Outcome-based supervision (only checking if the final answer is right) fails to correct flawed reasoning steps that accidentally lead to the right answer
  • Current reward models are one-dimensional (score 0-1) and cannot distinguish between different error types like logic failures vs. visual misperception
  • Scalable inference-time scaling (like OpenAI o1) requires high-quality, fine-grained verifiers, which are currently lacking for multimodal tasks
Concrete Example: A model might correctly count '5 dogs' but identify them as 'Labradors' when they are 'Poodles'. A standard outcome reward model might accept '5' as correct, reinforcing the hallucinated breed. SVIP detects the attribute error by checking the visual module's output against a verifier.
Key Novelty
Stepwise Visual Program (SVIP)
  • Uses visual programming (generating Python code) as a proxy for Chain-of-Thought, allowing rigorous automated verification of intermediate steps via code execution
  • Translates code execution traces (logic checks, compilation status, function returns) into natural language CoT steps with 3D labels: Relevance, Logic, and Attribute
  • Introduces TriAtt-CoT, a multi-head attention mechanism for reward models that specifically attends to these three distinct dimensions of reasoning quality
Architecture
Architecture Figure Figure 2
The SVIP framework pipeline: Code Generation -> Code Assessment -> CoT Conversion -> Alignment.
Evaluation Highlights
  • +6.3% improvement on the SVIP-Test benchmark for Qwen2-VL-7B when tuned with SVIP-generated data compared to the baseline
  • SVIP-Reward models achieve a +5.95% average improvement on SVIP-Test compared to standard fine-tuning, demonstrating the value of the specialized reward architecture
  • Reduces hallucinations and improves reasoning on general benchmarks like MME and MMMU by providing fine-grained, step-level supervision
Breakthrough Assessment
8/10
Cleverly bypasses the manual annotation bottleneck for CoT by leveraging code execution. The mapping of code errors to specific CoT error types (logic vs. attribute) is a significant methodological advance for reward modeling.
×