← Back to Paper List

AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, Peng Qi
University of Notre Dame
arXiv (2025)
MM RL Reasoning Factuality

📝 Paper Summary

Multimodal Reasoning Reinforcement Learning with Verifiable Rewards (RLVR) Process Supervision
AutoRubric-R1V stabilizes multimodal reasoning training by automatically distilling consistent reasoning steps from the model's own successful trajectories into rubrics that reward correct intermediate processes.
Core Problem
Reinforcement Learning with Verifiable Rewards (RLVR) typically rewards only the final answer correctness, encouraging models to learn shortcuts or 'spurious reasoning' where flawed logic accidentally yields the right result.
Why it matters:
  • Models trained only on outcomes often fail to generalize because they learn to 'hack' the reward rather than reason correctly
  • Existing process supervision methods rely on expensive human annotation or proprietary teacher models (e.g., GPT-4), which are costly and limited by the teacher's capability
  • Spurious reasoning undermines reliability, as models may generate contradictory intermediate steps that confuse users even if the final answer is correct
Concrete Example: In a geometry problem, a model might define a side length incorrectly (e.g., conflating BC with CD) but still arrive at the correct numerical answer due to canceling errors. Standard RLVR rewards this trajectory fully, reinforcing the logical error.
Key Novelty
Self-Aggregated Rubric Generation for Generative Rewards
  • Instead of external supervision, the method samples multiple trajectories from the model itself and filters for correct answers
  • An LLM compares these successful trajectories to identify 'reasoning checkpoints'—steps that appear consistently across majority of correct solutions—filtering out random or spurious steps
  • These distilled checkpoints form a problem-specific rubric used by a judge model to reward intermediate steps during RL training
Evaluation Highlights
  • +7.52% average accuracy improvement across 6 multimodal reasoning benchmarks compared to the Qwen-2.5-VL-7B base model
  • Achieves an average score of 54.81 on reasoning benchmarks, comparable to the much larger Qwen-2.5-VL-72B model (55.57)
  • Substantially reduces reasoning inconsistency (unfaithful reasoning steps) compared to standard RLVR training in MathVerse evaluations
Breakthrough Assessment
8/10
Offers a scalable, self-contained solution to the reward hacking problem in reasoning models without requiring external human annotations or stronger teachers. Significant performance gains matching 10x larger models.
×