← Back to Paper List

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo
ByteDance Seed, The University of Hong Kong
arXiv.org (2025)
MM RL Benchmark

📝 Paper Summary

Visual Generation Reinforcement Learning from Human Feedback (RLHF)
DanceGRPO adapts Group Relative Policy Optimization to visual generation tasks by reformulating diffusion and flow matching sampling as Stochastic Differential Equations, achieving stable training across image and video domains.
Core Problem
Existing RL-based fine-tuning methods for visual generation (like DDPO and DPOK) are unstable when scaling to large prompt sets and struggle with the deterministic sampling of rectified flow models.
Why it matters:
  • Aligning generative models with human preferences is critical for aesthetic quality and safety, but current methods either require differentiable rewards (ReFL) or offer only marginal gains (DPO variants)
  • Video generation poses severe VRAM constraints for differentiable reward methods
  • Prior policy gradient methods fail to scale beyond small datasets (<100 prompts) due to optimization instability
Concrete Example: When training video generation models like HunyuanVideo, using different initialization noise for the same prompt causes 'reward hacking' where the model exploits noise rather than learning alignment. Additionally, standard methods produce unnatural 'oily' artifacts when optimizing against single aesthetic rewards.
Key Novelty
DanceGRPO (Group Relative Policy Optimization for Visual Generation)
  • Reformulates both diffusion and rectified flow sampling as Stochastic Differential Equations (SDEs) to enable the stochastic exploration required for GRPO
  • Applies GRPO's group-based advantage estimation to visual tasks, using relative performance within a group of samples sharing the same prompt and initialization noise to stabilize training
  • Identifies critical stability factors for visual RL: shared initialization noise per prompt, specific timestep selection, and aggregating advantages rather than raw rewards from multiple models
Evaluation Highlights
  • Outperforms baselines by up to 181% on VideoAlign motion quality benchmarks for text-to-video generation
  • Achieves +12.6% improvement in HPS-v2.1 score (0.239 → 0.365) on Stable Diffusion v1.4 compared to base model
  • Successfully scales to large-scale datasets (>10,000 prompts) where prior methods like DDPO and DPOK fail to converge stably
Breakthrough Assessment
9/10
First successful application of GRPO to visual generation that unifies diffusion and flow matching. Solves major stability issues in RLHF for video, showing massive gains over existing baselines.
×