← Back to Paper List

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

Xiaokun Wang, Peiyu Wang, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, Yahui Zhou
Skywork AI, Kunlun Inc.
arXiv.org (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Reward Modeling Vision-Language Alignment
Skywork-VL Reward is a multimodal reward model trained on a large-scale curated dataset including reasoning traces, achieving state-of-the-art performance in evaluating both standard and reasoning-heavy vision-language outputs.
Core Problem
Existing multimodal reward models lack generalizability across diverse tasks and fail to effectively evaluate advanced reasoners that produce complex inference steps.
Why it matters:
  • Aligning Vision-Language Models (VLMs) with human preference is crucial for safety and utility but remains challenging.
  • Current reward models struggle with the complex reasoning outputs from newer 'system 2' style VLMs.
  • High-quality, diverse preference data for multimodal reasoning is scarce, limiting the effectiveness of alignment training.
Concrete Example: When a VLM generates a complex step-by-step solution to a physics problem based on an image, standard reward models might fail to distinguish a subtle reasoning error from a correct derivation, whereas Skywork-VL Reward uses specific reasoning preference data to catch this.
Key Novelty
Dual-Source Data Curation & Two-Stage Training
  • Constructs a massive preference dataset (190k pairs) by integrating standard VLM outputs with advanced reasoning traces generated by models like Skywork R1V and Deepseek R1.
  • employs a two-stage training strategy: first fine-tuning on multimodal data for vision-language alignment, then incorporating pure-text data to boost general reasoning and prevent catastrophic forgetting of text capabilities.
Architecture
Architecture Figure Figure 1 (implied)
Conceptual architecture of Skywork-VL Reward based on Qwen2.5-VL
Evaluation Highlights
  • Achieves state-of-the-art accuracy on VL-RewardBench among open-source models.
  • Preference data generated by this model improves downstream VLM reasoning significantly when used in Mixed Preference Optimization (MPO) training.
  • Maintains competitive performance on the text-only RewardBench, unlike many multimodal models that degrade on pure text tasks.
Breakthrough Assessment
8/10
Strong performance on benchmarks and a robust data curation pipeline for reasoning tasks make it a significant contribution to open-source multimodal alignment tools.
×