← Back to Paper List

Video Diffusion Alignment via Reward Gradients

Mihir Prabhudesai, R. Mendonca, Zheyang Qin, Katerina Fragkiadaki, Deepak Pathak
Department of Computer Science, Carnegie Mellon University
arXiv.org (2024)
MM RL

📝 Paper Summary

Video Diffusion Models Generative Model Alignment
VADER fine-tunes video diffusion models by backpropagating gradients from differentiable reward models (like CLIP or VideoMAE) directly into the diffusion process, enabling efficient alignment without collecting target video datasets.
Core Problem
Adapting video diffusion models to specific tasks typically requires collecting expensive target video datasets or using inefficient reinforcement learning methods that rely on sparse scalar feedback.
Why it matters:
  • Collecting target datasets of videos for every new task is prohibitively expensive and tedious compared to image or text domains.
  • General-purpose web-scale models often produce content with dull colors, poor camera angles, or temporal inconsistencies unsuited for professional animation or robotics.
  • Current gradient-free alignment methods (like those using policy gradients) scale poorly to video because they reduce rich spatial-temporal feedback to a single scalar value.
Concrete Example: An animator needs a video that strictly adheres to a script and specific camera angles. A standard web-scale model might generate the correct object but with random, jerky camera motion. VADER uses a reward model to enforce smooth camera trajectories without needing a dataset of 'smooth camera' videos.
Key Novelty
VADER (Video Alignment via DifferEntiable Rewards)
  • Leverages pre-trained differentiable reward models (e.g., aesthetic predictors, object detectors) to compute gradients w.r.t. generated pixels.
  • Backpropagates these dense gradients directly through the diffusion denoising process to update model weights, rather than treating the reward as a black-box scalar.
  • Utilizes memory-saving tricks like truncated backpropagation (1 step), LoRA, and frame subsampling to make video gradient computation feasible on consumer hardware (16GB VRAM).
Architecture
Architecture Figure Algorithm 1 / Implicit System Diagram
The training loop where a video is generated, passed to a reward model, and gradients are backpropagated.
Evaluation Highlights
  • VADER outperforms gradient-free baselines (DDPO, DPO) in sample efficiency and alignment quality across text-to-video and image-to-video tasks.
  • Successfully aligns models using diverse rewards: image aesthetics, text-alignment (HPSv2), object removal (YOLOS), action classification (VideoMAE), and temporal consistency (V-JEPA).
  • Generalizes well to unseen prompts during inference, maintaining improvements in aesthetic quality and instruction following.
Breakthrough Assessment
8/10
Significant step in making video alignment tractable without massive datasets. By effectively using reward gradients, it bridges the gap between expensive supervised fine-tuning and inefficient RL, making custom video generation accessible.
×