Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang
Google,
University of Waterloo
Neural Information Processing Systems
(2024)
MMRLBenchmark
📝 Paper Summary
Video GenerationConsistency Models
T2V-Turbo accelerates high-quality video generation by integrating mixed image and video reward feedback directly into the single-step consistency distillation process, bypassing iterative backpropagation memory costs.
Core Problem
Diffusion-based video models suffer from slow iterative sampling, while faster Consistency Models (CMs) face a quality bottleneck limited by their teacher models and reduced step counts.
Why it matters:
Slow inference prevents real-time applications of high-quality video generation models
Existing open-source models trained on web-scale data often fail to align with human aesthetic preferences and text prompts
Previous reward-finetuning methods like InstructVideo are memory-prohibitive due to backpropagating gradients through long sampling chains
Concrete Example:A standard diffusion model requires 50 iterative steps to generate a coherent video, taking seconds to minutes. Existing distillation methods reduce this to 4 steps but produce blurry or temporally inconsistent results because they only mimic the teacher model without incorporating direct human preference feedback.
Key Novelty
Mixed Reward Feedback in Consistency Distillation
Integrates feedback from both Image-Text (spatial) and Video-Text (temporal) reward models directly into the distillation loss
Optimizes the single-step generation produced during consistency distillation, avoiding the high memory cost of backpropagating through iterative sampling steps
Architecture
The training pipeline showing how reward feedback is integrated into Consistency Distillation
Evaluation Highlights
Achieves >10x inference acceleration (4 steps vs 50 steps) while improving quality over teacher models
4-step generations surpass proprietary systems Gen-2 and Pika on the VBench evaluation benchmark
Human evaluators prefer 4-step T2V-Turbo videos over 50-step generations from the original teacher models (VideoCrafter2 and ModelScopeT2V)
Breakthrough Assessment
8/10
Successfully combines speed (consistency models) with high quality (reward feedback) in a memory-efficient way, outperforming proprietary commercial baselines with significantly fewer inference steps.
⚙️ Technical Details
Problem Definition
Setting: Text-to-Video generation using Consistency Models distilled from pre-trained Latent Diffusion Models
Inputs: Text prompt c
Outputs: Generated video x
Pipeline Flow
Text Encoding
Latent Denoising (VCM)
Video Decoding
System Modules
Text Encoder
Converts text prompts into embeddings for conditioning
Model or implementation: Not explicitly specified (implied CLIP/T5 based on teacher models)
Video Consistency Model (VCM)
Predicts the clean video latent directly from noise in a single or few steps
Model or implementation: UNet-based architecture initialized from teacher T2V model (VideoCrafter2 or ModelScopeT2V) with LoRA adapters
VAE Decoder
Decodes the predicted latent representation into pixel-space video frames
Model or implementation: Pre-trained VAE Decoder (frozen)
Novel Architectural Elements
Integration of mixed reward feedback (Image and Video) directly into the single-step output of the consistency distillation process
Modeling
Base Model: VideoCrafter2 and ModelScopeT2V
Training Method: Consistency Distillation (CD) augmented with Differentiable Reward Feedback
Objective Functions:
Purpose: Enforce self-consistency of the model predictions across time steps.
Formally: L_CD metric between predicted z_0 from current step and z_0 from next step (estimated via ODE solver)
Purpose: Align individual frames with human aesthetic preferences.
Formally: J_img = log(sigmoid(Reward_img(x_decoded))) maximized for sampled frames
Purpose: Align video dynamics and transitions with text description.
Formally: J_vid = log(sigmoid(Reward_vid(x_decoded))) maximized for full video
vs. InstructVideo: Optimizes single-step CD generations instead of backpropagating through the sampling chain, avoiding memory bottlenecks
vs. VideoCrafter2 (Teacher): Achieves comparable or better quality in 4 steps vs 50 steps
vs. Gen-2/Pika: Surpasses these proprietary models on VBench with a distilled open-source model [not cited in paper as direct architecture comparison, but as baseline]
Limitations
Reliance on the quality of the teacher model for initialization
Performance depends on the accuracy of the differentiable reward models used for feedback
Reproducibility
Training details (hyperparameters, GPUs) are provided. Teacher models (VideoCrafter2, ModelScopeT2V) and reward models (HPSv2.1, InternVideo2, ViCLIP) are public. Code availability is not explicitly confirmed in the provided text.
📊 Experiments & Results
Evaluation Setup
Text-to-Video generation evaluated on standardized prompts
Benchmarks:
VBench (Comprehensive Video Evaluation (16 dimensions))
EvalCrafter (Human Evaluation (700 prompts))
Metrics:
VBench Total Score
Quality Score
Semantic Score
Human Preference (Win Rate)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Inference Latency
Steps
50
4
-46
Main Takeaways
4-step generations from T2V-Turbo achieve the highest Total Score on VBench, surpassing both open-source baselines (VideoCrafter2, ModelScopeT2V) and proprietary models (Gen-2, Pika).
Human evaluation on EvalCrafter prompts confirms that 4-step T2V-Turbo videos are preferred over 50-step videos from the teacher models, validating the effectiveness of the mixed reward feedback.
The method successfully breaks the 'quality bottleneck' of consistency models, allowing the distilled student to outperform the teacher model in quality despite using significantly fewer inference steps.
📚 Prerequisite Knowledge
Prerequisites
Diffusion Models (DM) and Probability Flow ODEs
Consistency Models (CM) and Consistency Distillation (CD)
Classifier-Free Guidance (CFG)
Low-Rank Adaptation (LoRA)
Key Terms
VCM: Video Consistency Model—a model trained to map any point on a diffusion trajectory directly to the data origin, enabling fast few-step generation
CD: Consistency Distillation—a training technique to convert a slow diffusion teacher into a fast consistency student
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights