← Back to Paper List

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
arXiv (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Reward Modeling Reinforcement Learning for MLLMs
R1-Reward employs a stabilized reinforcement learning algorithm to train multimodal reward models, treating reward scoring as a reasoning task and ensuring consistency between the model's thought process and final judgment.
Core Problem
Directly applying standard RL (PPO, Reinforce++) to reward modeling causes training collapse due to numerical instability from binary rewards and disconnects between reasoning and outputs.
Why it matters:
  • Standard reward models fail to utilize detailed reasoning, acting as opaque 'black boxes' with scalar outputs
  • Binary rewards in RL (0 or 1) lead to low-variance batches where advantage normalization causes exploding values (e.g., -15.96), destabilizing training
  • Without supervision, models learn to output the correct score without coherent reasoning, leading to 'reward hacking' where the result is right but the logic is wrong
Concrete Example: In a training batch with 255 correct predictions (reward 1) and 1 incorrect (reward 0), standard advantage normalization transforms the single 0 reward into a massive negative advantage (e.g., -15.96). This outlier causes extreme gradient updates that crash the model, a failure mode common in PPO/Reinforce++.
Key Novelty
StableReinforce Algorithm for Reasoning-Based Reward Modeling
  • Reformulates reward modeling as a rule-based RL task where the model generates a reasoning chain before outputting a preference, enabling long-term reasoning capabilities
  • Introduces 'StableReinforce', which modifies the clipping and normalization mechanisms of PPO/Reinforce++ to handle the numerical instabilities inherent in binary reward distributions
  • Uses an MLLM 'referee' during training to enforce consistency, penalizing the model if its generated reasoning argues for one answer but its final token selects the other
Evaluation Highlights
  • +13.5% improvement on the VL Reward-Bench compared to state-of-the-art models (using inference-time scaling)
  • +14.6% improvement on the Multimodal Reward Bench compared to state-of-the-art
  • +8.4% improvement on VL Reward-Bench with the base model (before inference-time scaling)
Breakthrough Assessment
8/10
Significant methodology improvement for training reward models with RL, addressing core stability issues in PPO/Reinforce++ for this domain. Large empirical gains (>10%) on standard benchmarks.
×