DanceGRPO: Unleashing GRPO on Visual Generation

📝 Paper Summary

Visual Generation Reinforcement Learning from Human Feedback (RLHF)

DanceGRPO adapts Group Relative Policy Optimization to visual generation tasks by reformulating diffusion and flow matching sampling as Stochastic Differential Equations, achieving stable training across image and video domains.

Core Problem

Existing RL-based fine-tuning methods for visual generation (like DDPO and DPOK) are unstable when scaling to large prompt sets and struggle with the deterministic sampling of rectified flow models.

Why it matters:

Aligning generative models with human preferences is critical for aesthetic quality and safety, but current methods either require differentiable rewards (ReFL) or offer only marginal gains (DPO variants)
Video generation poses severe VRAM constraints for differentiable reward methods
Prior policy gradient methods fail to scale beyond small datasets (<100 prompts) due to optimization instability

Concrete Example: When training video generation models like HunyuanVideo, using different initialization noise for the same prompt causes 'reward hacking' where the model exploits noise rather than learning alignment. Additionally, standard methods produce unnatural 'oily' artifacts when optimizing against single aesthetic rewards.

Key Novelty

DanceGRPO (Group Relative Policy Optimization for Visual Generation)

Reformulates both diffusion and rectified flow sampling as Stochastic Differential Equations (SDEs) to enable the stochastic exploration required for GRPO
Applies GRPO's group-based advantage estimation to visual tasks, using relative performance within a group of samples sharing the same prompt and initialization noise to stabilize training
Identifies critical stability factors for visual RL: shared initialization noise per prompt, specific timestep selection, and aggregating advantages rather than raw rewards from multiple models

Evaluation Highlights

Outperforms baselines by up to 181% on VideoAlign motion quality benchmarks for text-to-video generation
Achieves +12.6% improvement in HPS-v2.1 score (0.239 → 0.365) on Stable Diffusion v1.4 compared to base model
Successfully scales to large-scale datasets (>10,000 prompts) where prior methods like DDPO and DPOK fail to converge stably

Breakthrough Assessment

9/10

First successful application of GRPO to visual generation that unifies diffusion and flow matching. Solves major stability issues in RLHF for video, showing massive gains over existing baselines.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning generative models (diffusion or rectified flows) via Reinforcement Learning to maximize a non-differentiable reward signal

Inputs: Text prompt c (and optionally an input image for I2V tasks)

Outputs: Generated image or video sequence x_0

Pipeline Flow

Prompt Sampling
Group Generation (Policy Rollout)
Reward Evaluation
Advantage Computation
Policy Update

System Modules

Policy Model (Generator)

Generates a group of G outputs {o_1, ..., o_G} for a given prompt c using SDE-based sampling

Model or implementation: Various (Stable Diffusion v1.4, FLUX.1-dev, HunyuanVideo, SkyReels-I2V)

Reward Models

Assign scalar scores to each generated output

Model or implementation: Ensemble (HPS-v2.1, CLIP, VideoAlign)

GRPO Optimizer

Computes group relative advantages and updates the policy to maximize expected reward

Model or implementation: PPO-style clipped objective without a value network

Novel Architectural Elements

Unified SDE sampling formulation for both diffusion and rectified flow models to support GRPO exploration
Shared noise initialization strategy across the group for identical prompts to prevent reward hacking

Modeling

Base Model: Stable Diffusion v1.4, FLUX.1-dev, HunyuanVideo, SkyReels-I2V

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward while staying close to the old policy.

Formally: E [ min( rho * A_i, clip(rho, 1-eps, 1+eps) * A_i ) ] where rho is the probability ratio and A_i is the group relative advantage.

Training Data:

Prompts from VidProM (video), ConsisID (I2V), and curated internal datasets
Scaling: >10,000 prompts used for optimization

Key Hyperparameters:

group_size_G: Not explicitly reported in the paper (implied standard GRPO settings)
clip_epsilon: Not explicitly reported in the paper
kl_penalty: Omitted by default (empirically found unnecessary)

Compute: 32 H800 GPUs (Flow T2I/I2V), 8 H800 GPUs (SD), 64 H800 GPUs (T2V). Training time not reported.

Comparison to Prior Work

vs. DDPO/DPOK: DanceGRPO eliminates the need for a critic model (value network) and is stable on large prompt sets (>10k) where DDPO/DPOK fail.
vs. ReFL: DanceGRPO supports non-differentiable rewards (black-box) and is more memory efficient for video.
vs. DPO variants: DanceGRPO achieves significantly higher reward gains (e.g., +181% vs marginal DPO gains) by actively exploring via sampling.
+ 1 more
vs. DeepSeek-R1 [not cited in paper]: Applies the GRPO algorithm (originally for LLMs in DeepSeek-R1) to the visual domain by solving specific SDE/ODE sampling incompatibilities.

Limitations

Requires reformulating ODE samplers (Rectified Flows) as SDEs, which adds complexity
Video generation training is computationally expensive (64 H800 GPUs)
Text-video alignment reward was found unstable and excluded from final video analysis
Best-of-N scaling relies on brute-force search, which is computationally inefficient compared to tree search

Reproducibility

Code: https://github.com/XueZeyue/DanceGRPO

publicly available (https://github.com/XueZeyue/DanceGRPO). Code is released. Specific hyperparameters like learning rates and clip epsilon are not detailed in the main text but reference Appendix 6 (which is not fully parsed here). Prompts for benchmarks are standard/public.

📊 Experiments & Results

Evaluation Setup

RLHF fine-tuning on multiple foundation models across image and video tasks.

Benchmarks:

HPS-v2.1 Benchmark (Text-to-Image Generation)
GenEval (Text-to-Image Generation)
Pick-a-Pic (Text-to-Image Generation)
VideoAlign (Text-to-Video and Image-to-Video Generation)

Metrics:

HPS-v2.1 Score
CLIP Score
VideoAlign Score (Aesthetics, Motion)
Pick-a-Pic Win Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DanceGRPO significantly improves aesthetic and alignment scores across Stable Diffusion and FLUX models.
HPS-v2.1 Benchmark	HPS Score	0.239	0.365	+0.126
HPS-v2.1 Benchmark	CLIP Score	0.363	0.395	+0.032
Internal Evaluation	VideoAlign Motion Score	Not reported in the paper	Not reported in the paper	Not reported in the paper
Internal Evaluation	VideoAlign Motion Score	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Impact of Best-of-N inference scaling on training convergence.

Main Takeaways

Unified Framework: Successfully applies one algorithm (DanceGRPO) to Diffusion (SD1.4) and Rectified Flows (FLUX, Hunyuan), and across Image/Video tasks.
Scalability: Unlike DDPO/DPOK which fail >100 prompts, DanceGRPO scales stably to >10,000 prompts.
Video Capability: First RL method validated for video generation that solves VRAM issues of differentiable rewards.
Binary Rewards: Demonstrated ability to learn from sparse, thresholded binary rewards (simulated DeepSeek-R1 style feedback).

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models and Rectified Flows
Reinforcement Learning (Policy Gradients)
Stochastic Differential Equations (SDEs)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, removing the need for a separate value network critic

SDE: Stochastic Differential Equation—a differential equation where one or more terms are stochastic processes, used here to inject noise into the sampling process for exploration

DDPO: Denoising Diffusion Policy Optimization—a prior RL method for fine-tuning diffusion models using policy gradients

DPOK: Diffusion Policy Optimization with KL regularization—another prior RL method for diffusion models

Rectified Flow: A generative model framework that learns a transport map between noise and data distributions via Ordinary Differential Equations (ODEs)

CFG: Classifier-Free Guidance—a technique to improve sample quality by mixing conditional and unconditional score estimates

Best-of-N: An inference strategy where N samples are generated and the best one is selected based on a reward model; used here as a scaling strategy for training data

HPS-v2.1: Human Preference Score—a reward model trained to predict human aesthetic preferences for images

VideoAlign: A reward model for video generation assessing aesthetics, motion quality, and text alignment

ReFL: Reward-Weighted Fine-Tuning—a method that weights the training loss of diffusion models by the reward of the generated sample

MDP: Markov Decision Process—a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker