Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

📝 Paper Summary

Multimodal Reward Modeling Vision-Language Alignment

Skywork-VL Reward is a multimodal reward model trained on a large-scale curated dataset including reasoning traces, achieving state-of-the-art performance in evaluating both standard and reasoning-heavy vision-language outputs.

Core Problem

Existing multimodal reward models lack generalizability across diverse tasks and fail to effectively evaluate advanced reasoners that produce complex inference steps.

Why it matters:

Aligning Vision-Language Models (VLMs) with human preference is crucial for safety and utility but remains challenging.
Current reward models struggle with the complex reasoning outputs from newer 'system 2' style VLMs.
High-quality, diverse preference data for multimodal reasoning is scarce, limiting the effectiveness of alignment training.

Concrete Example: When a VLM generates a complex step-by-step solution to a physics problem based on an image, standard reward models might fail to distinguish a subtle reasoning error from a correct derivation, whereas Skywork-VL Reward uses specific reasoning preference data to catch this.

Key Novelty

Dual-Source Data Curation & Two-Stage Training

Constructs a massive preference dataset (190k pairs) by integrating standard VLM outputs with advanced reasoning traces generated by models like Skywork R1V and Deepseek R1.
employs a two-stage training strategy: first fine-tuning on multimodal data for vision-language alignment, then incorporating pure-text data to boost general reasoning and prevent catastrophic forgetting of text capabilities.

Architecture

Conceptual architecture of Skywork-VL Reward based on Qwen2.5-VL

Evaluation Highlights

Achieves state-of-the-art accuracy on VL-RewardBench among open-source models.
Preference data generated by this model improves downstream VLM reasoning significantly when used in Mixed Preference Optimization (MPO) training.
Maintains competitive performance on the text-only RewardBench, unlike many multimodal models that degrade on pure text tasks.

Breakthrough Assessment

8/10

Strong performance on benchmarks and a robust data curation pipeline for reasoning tasks make it a significant contribution to open-source multimodal alignment tools.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference ranking: Given input x (image+text) and two responses (y_w, y_l), predict scalar rewards r(x, y) such that r(x, y_w) > r(x, y_l).

Inputs: Multimodal prompt x (optional image + text instruction) and candidate response y.

Outputs: Scalar reward score s.

Pipeline Flow

Visual Encoder (ViT) processes image
Projector maps visual tokens to text space
LLM Backbone processes text + visual tokens
Reward Head predicts scalar score

System Modules

Visual Encoder (Input Processing)

Encodes input images into patch features

Model or implementation: Vision Transformer (ViT) from Qwen2.5-VL-7B-Instruct

Projector (Input Processing)

Projects visual features into the language model's embedding space

Model or implementation: Adapter module

LLM Backbone

Processes multimodal context and response

Model or implementation: Qwen2.5-VL-7B-Instruct decoder

Reward Head

Maps the final hidden state to a scalar reward

Model or implementation: Fully-connected linear layer

Novel Architectural Elements

Replacement of causal LM head with a scalar reward head on top of Qwen2.5-VL architecture

Modeling

Base Model: Qwen2.5-VL-7B-Instruct

Training Method: Supervised fine-tuning with Pairwise Ranking Loss

Objective Functions:

Purpose: Maximize the score difference between chosen and rejected responses.

Formally: L = -log(sigmoid(r(x, y_w) - r(x, y_l)))

Training Data:

Total: ~190k pairs (70% multimodal)
Sources: LLaVA-Critic-113k, Skywork-Reward-Preference-80K-v0.2, RLAIF-V-Dataset
In-house Reasoning: ~50k pairs (Math, Physics, Bio, Chem) via Skywork R1V and InternVL+Deepseek pipeline

Key Hyperparameters:

learning_rate_stage_1: 1e-5
learning_rate_stage_2: 1e-6
epochs_per_stage: 2
+ 1 more
optimizer: AdamW

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLaVA-RLHF: Skywork-VL Reward uses a much larger, curated dataset including advanced reasoning traces [not cited in paper]
vs. IXC-2.5-Reward: Skywork-VL Reward explicitly incorporates in-house reasoning data and a two-stage training (multimodal then text+multimodal) to preserve text capability

Limitations

Examples with equal or near-equal preferences are excluded, potentially limiting fine-grained discrimination.
Visual encoder is frozen, potentially limiting adaptation to entirely new visual domains.
Dependence on GPT-4o for data filtering and regeneration implies reliance on proprietary model quality.

Reproducibility

Code: https://huggingface.co/Skywork/Skywork-VL-Reward-7B

Model weights released on HuggingFace. Training data composition detailed (sources and filtering/regeneration process described). Specific training compute/time not reported.

📊 Experiments & Results

Evaluation Setup

Evaluation on standard multimodal reward benchmarks and text-only reward benchmarks.

Benchmarks:

VL-RewardBench (Multimodal Reward Modeling (General, Hallucination, Reasoning))
RewardBench (Text-only Reward Modeling (Chat, Safety, Reasoning))

Metrics:

Overall Accuracy
Average Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
VL-RewardBench	Overall Accuracy	83.4	85.8	+2.4
VL-RewardBench	Reasoning Accuracy	73.2	83.6	+10.4
RewardBench	Overall Score	70.5	80.6	+10.1

Experiment Figures

Distribution of the 150,000 retained data samples after Stage 2 filtering.

Main Takeaways

Skywork-VL Reward achieves state-of-the-art performance on VL-RewardBench, particularly excelling in reasoning tasks.
The model maintains strong text-only performance on RewardBench, outperforming its base model and other multimodal RMs.
Inclusion of reasoning-specific preference data is highly effective for improving reasoning evaluation metrics.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Reward Modeling (RM)
Reinforcement Learning from Human Feedback (RLHF)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

VLM: Vision-Language Model—an AI model capable of processing and generating both images and text.

RM: Reward Model—a model trained to predict human preference scores for generated outputs.

MPO: Mixed Preference Optimization—a training method that optimizes models using preference pairs from multiple sources or domains.

RLHF: Reinforcement Learning from Human Feedback—a technique to align AI models with human values using reward signals.

DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without an explicit reward model.

ViT: Vision Transformer—a model architecture that applies the Transformer mechanism directly to sequences of image patches.

Qwen2.5-VL: A specific open-source Vision-Language Model developed by Alibaba Cloud.

InternVL: A series of open-source Vision-Language Models.

Deepseek R1: A large language model known for strong reasoning capabilities.