RewardDance: Reward Scaling in Visual Generation

📝 Paper Summary

Visual Generation Reward Modeling Reinforcement Learning from Human Feedback (RLHF)

RewardDance reformulates visual reward modeling as a scalable generative task—predicting 'yes' tokens for preferred images—enabling effective scaling of model size and context to improve diffusion model alignment.

Core Problem

Existing visual reward models suffer from paradigm mismatch: regression heads on Vision-Language Models (VLMs) misalign with next-token prediction, while CLIP-based models struggle to scale.

Why it matters:

Regression-based reward models are highly susceptible to 'reward hacking', where generators exploit flaws in the reward signal without improving true quality.
Current approaches fail to leverage the full reasoning capabilities of large VLMs because they reduce complex preference judgments to a single scalar output.
Lack of scalability prevents visual generation from benefiting from the 'scaling laws' that have driven progress in LLMs.

Concrete Example: In Figure 2, a regression-based reward model's score increases during RLHF training, but the actual human preference win-rate collapses, showing the model is 'hacking' the reward metric rather than generating better images.

Key Novelty

Generative Reward Modeling with Dual Scaling

Redefines reward calculation as a generative next-token prediction task (predicting the probability of a 'yes' token given a comparison prompt), aligning natively with VLM architectures.
Scales reward modeling across two dimensions: Model Scaling (systematically increasing parameters from 1B to 26B) and Context Scaling (incorporating task instructions, reference images, and Chain-of-Thought reasoning).

Architecture

Comparison of traditional Pointwise Regressive Reward Models vs. the proposed Generative RewardDance framework.

Evaluation Highlights

Significant improvements in generation quality across text-to-image, text-to-video, and image-to-video tasks compared to state-of-the-art baselines.
Scaling the Reward Model to 26B parameters drastically reduces reward hacking, maintaining high reward variance and correlation with human preference even at high RL training steps.
Integration of Chain-of-Thought (CoT) reasoning data enables the reward model to provide interpretable feedback, further boosting accuracy over simple preference pairs.

Breakthrough Assessment

9/10

Establishes a new scaling law for visual reward models, proving that generative RMs scale effectively up to 26B parameters and solve the critical reward hacking problem plaguing RLHF in vision.

⚙️ Technical Details

Problem Definition

Setting: Visual Reward Modeling and Preference Alignment for Diffusion Models

Inputs: Prompt y, Image 1 x1, Image 2 x2 (for pairwise), Task Instruction i, CoT reasoning (optional)

Outputs: Reward score derived from the probability of generating the 'yes' token P(yes|...)

Pipeline Flow

Input: Prompt + Image Pair (Candidate vs Reference) + Instruction
RewardDance (VLM) processes inputs
Generates 'yes'/'no' token probability
Extracts Reward Score P(yes)
RL Optimization (ReFL) updates Diffusion Model

System Modules

RewardDance Model

Evaluate visual quality and alignment by comparing two images or scoring one image via instruction-following

Model or implementation: InternVL variants (1B to 26B parameters)

Diffusion Policy

Generate images/videos based on text prompts

Model or implementation: Stable Diffusion v1.5 / SDXL / PixelDance / Wan2.1

Novel Architectural Elements

Generative Reward Head: Replaces standard regression head with next-token prediction of 'yes/no' tokens to align with VLM pre-training.
Context-Aware Input Structure: Natively integrates reference images and Chain-of-Thought reasoning into the reward scoring forward pass.

Modeling

Base Model: InternVL (scaled from 1B, 2B, 8B, to 26B)

Training Method: Generative Reward Modeling (Next-Token Prediction on Preference Data)

Objective Functions:

Purpose: Maximize likelihood of 'yes' token for preferred image.

Formally: Standard Cross-Entropy on 'yes'/'no' tokens, equivalent to optimizing pairwise preference probability.
Purpose: Incorporate reasoning into training.

Formally: Autoregressive modeling of CoT rationale tokens before/after the judgment token.

Training Data:

Preference pairs with 'yes'/'no' labels
Task-aware instructions
CoT reasoning traces distilled from SEED-VL 1.5

Key Hyperparameters:

model_sizes: 1B, 2B, 8B, 26B

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. ImageReward/HPSv2: Uses Generative VLM backbone instead of CLIP regression; scales to 26B params vs ~300M-1B.
vs. InternVL-Reward: Uses generative 'yes/no' token probability instead of regression head; incorporates CoT and reference images.
vs. UnifiedReward [not cited in paper]: Both use generative paradigms, but RewardDance explicitly systematically scales model size to 26B and focuses on the 'yes' token probability specifically for diffusion alignment.

Limitations

Computational cost of inference for the 26B parameter reward model is high compared to CLIP-based models.
Requires high-quality CoT data for training, which must be distilled from larger teacher models.
Pairwise reward calculation requires Best-of-N sampling or reference images, which adds complexity compared to absolute scoring.

Reproducibility

Code: https://github.com/Bytedance/RewardDance

Code is publicly available at https://github.com/Bytedance/RewardDance. The paper specifies using InternVL as the backbone and SEED-VL 1.5 for distilling CoT data. Specific hyperparameters for ReFL (learning rates, batch sizes) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluation on Text-to-Image (T2I), Text-to-Video (T2V), and Image-to-Video (I2V) generation tasks.

Benchmarks:

ImageRewardDB (Preference Prediction)
HPSv2 (Preference Prediction / Generation Quality)
VBench (Video Generation Quality)
Pick-a-Pic (Preference Prediction)

Metrics:

Accuracy (Preference Prediction)
Win Rate vs. Base Model (Generation Quality)
Reward Score (Proxy for Quality)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Preference prediction accuracy on standard benchmarks shows RewardDance scaling behavior.
ImageRewardDB	Accuracy	65.1	78.2	+13.1
HPSv2	Accuracy	75.7	86.3	+10.6
Generation quality improvements after RLHF alignment using RewardDance.
Internal User Study (T2I)	Win Rate vs SDXL Base	50.0	68.4	+18.4
VBench (T2V)	Total Score	80.45	82.12	+1.67

Experiment Figures

Win rate of generated images vs. Reference as a function of Reward Model size (Parameters).

Reward Hacking Analysis: Comparing Reward Score vs. True Win Rate over RL training steps.

Main Takeaways

Scaling Law for RMs: Larger reward models (up to 26B) consistently yield better preference accuracy and generation quality.
Resistance to Hacking: Unlike regression-based models (ImageReward, HPSv2), RewardDance maintains high correlation with human preference even after extensive RL optimization steps.
Context Matters: Adding CoT reasoning and reference images significantly boosts reward model accuracy compared to using images and prompts alone.
Unified Framework: Effectively handles image and video modalities within a single generative architecture.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (Text-to-Image/Video)
Reinforcement Learning from Human Feedback (RLHF)
Vision-Language Models (VLMs)
Bradley-Terry Model

Key Terms

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using a reward model trained on human preferences.

VLM: Vision-Language Model—a model that processes both images and text to generate text or embeddings.

Reward Hacking: A phenomenon where a generative model optimizes for the reward signal (getting a high score) without actually improving the underlying quality or human preference, often by exploiting bugs in the reward model.

CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer.

Bradley-Terry Model: A statistical model for estimating the probability that one item is preferred over another in a pairwise comparison.

BoN: Best-of-N—a sampling strategy where N candidates are generated, and the one with the highest reward score is selected.

ReFL: Reward-weighted Feedback Learning—an algorithm that optimizes diffusion models using gradients from a frozen reward model without computing log-likelihoods.

ODE sampling: Ordinary Differential Equation sampling—a deterministic method for generating samples from diffusion models by solving the probability flow ODE.

Search over Paths: An inference-time scaling technique that prunes generation trajectories during sampling based on intermediate reward feedback.