DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

📝 Paper Summary

Video Generation Model Distillation Reward Fine-tuning

DOLLAR enables high-quality 4-step video generation by combining variational score and consistency distillation with a memory-efficient latent reward model that bypasses pixel-space decoding.

Core Problem

Diffusion video models require computationally expensive sampling (50+ steps), and existing few-step distillation methods often sacrifice diversity or fail to align with aesthetic rewards due to memory constraints.

Why it matters:

Standard diffusion sampling is too slow for real-time video applications, often taking minutes per clip
Current distillation methods like Consistency Distillation (CD) often produce blurry or over-smoothed results in video
Fine-tuning with reward models (like HPSv2) is memory-prohibitive for videos because gradients must backpropagate through the decoder and large pixel-space reward networks

Concrete Example: When distilling a teacher model to 4 steps, standard Variational Score Distillation (VSD) often suffers from mode collapse (low diversity), while Consistency Distillation (CD) lacks fine detail. Furthermore, optimizing for 'aesthetic quality' usually causes Out-Of-Memory (OOM) errors on consumer GPUs because the entire video must be decoded to pixels to calculate the reward.

Key Novelty

DOLLAR (Distillation and Latent Reward Optimization)

Combines Variational Score Distillation (for quality) and Consistency Distillation (for diversity) to train a few-step student generator
Learns a lightweight 'Latent Reward Model' (LRM) that approximates complex pixel-space rewards (like aesthetics) directly in the latent space
Optimizes the diffusion model using gradients from the LRM, avoiding the need to decode videos or backpropagate through massive original reward models

Architecture

The distillation and fine-tuning framework (training pipeline) showing how VSD, CD, and LRM losses update the student generator.

Evaluation Highlights

Achieves 82.57 Total VBench Score (using HPSv2 reward), outperforming the 50-step Teacher model (80.25) and baselines like Gen-3 (82.32) and Kling (81.85)
One-step distillation accelerates diffusion sampling by 278.6x compared to the teacher model, reducing diffusion time to 0.33% of the original
Student model maintains high generation diversity (Vendi score 1.98) while achieving 4-step inference, surpassing pure VSD (1.91) which suffers from mode collapse

Breakthrough Assessment

8/10

Significantly reduces the computational cost of video generation while improving quality metrics over the teacher. The Latent Reward Model is a clever solution to the memory bottleneck in video RLHF.

⚙️ Technical Details

Problem Definition

Setting: Text-to-Video (T2V) generation using a distilled diffusion model with minimized sampling steps

Inputs: Text prompt c and initial noise epsilon

Outputs: Generated video x_0 (sequence of frames)

Pipeline Flow

Text Encoder (Processes prompt)
Student DiT Generator (Denoises latents in 4 steps)
VAE Decoder (Converts latents to pixel video)

System Modules

Text Encoder

Encodes text prompt into embeddings

Model or implementation: T5 (implied via CogVideoX architecture)

Student Generator

Predicts clean video latents from noise in few steps

Model or implementation: Diffusion Transformer (DiT), 3D variant

VAE Decoder

Decodes latent representations into RGB video frames

Model or implementation: 3D Variational Autoencoder (VAE)

Modeling

Base Model: Modified CogVideoX (DiT architecture)

Training Method: Multi-objective Distillation (VSD + CD) with Latent Reward Fine-tuning

Objective Functions:

Purpose: Minimize distribution mismatch between student and teacher.

Formally: L_VSD = E[-(s_real(x) - s_fake(x)) * grad(G_theta)]
Purpose: Enforce consistency of predictions across timesteps to enable few-step sampling.

Formally: L_CD = E[distance(f_theta(x_{t+m}), f_theta_minus(x_hat_t))]
Purpose: Align generation with human preference rewards efficiently.

Formally: L_FT = -E[LRM_phi(G_theta(epsilon))]

Adaptation: Full fine-tuning of student weights initialized from teacher

Training Data:

320K licensed single-shot videos
Internal image and video datasets with text captioning

Key Hyperparameters:

student_steps: 4 (timesteps [249, 499, 749, 999])
teacher_steps: 50 (DDIM)
learning_rate: 2e-5
+ 4 more
batch_size: 1 (per GPU)
beta_CD: 0.5
beta_FT: 1.0
optimizer: AdamW

Compute: 8 NVIDIA A100 GPUs, trained for 40k iterations

Comparison to Prior Work

vs. T2V-Turbo: DOLLAR uses Latent Reward Model (LRM) to bypass decoder backpropagation, saving significant memory
vs. VideoLCM: DOLLAR combines VSD with CD to balance mode collapse (VSD) and blurriness (CD), whereas VideoLCM relies primarily on CD
vs. Teacher (CogVideoX): Reduces inference steps from 50 to 4 while improving VBench scores
+ 1 more
vs. DMD2 [not cited in paper]: DMD2 uses adversarial losses for image distribution matching; DOLLAR avoids GAN losses in favor of VSD+CD+LRM for stability [not cited in paper]

Limitations

Dynamic degree (motion amount) optimization can lead to 'noise flow' artifacts if over-optimized
VBench evaluation relies on 'long prompts' generated by GPT-4o; performance drops on short prompts without this augmentation
Memory usage is still high for 4-step full-sequence generation during training, necessitating batch size of 1 per GPU
LRM training requires an initial phase to approximate the pixel-space reward

Reproducibility

Project page available at https://quantumiracle.github.io/dollar/. Code URL not explicitly provided in text. Uses licensed internal video data for training, which limits data reproducibility. Built on CogVideoX architecture.

📊 Experiments & Results

Evaluation Setup

Large-scale text-to-video generation evaluation using automated benchmarks and human preference

Benchmarks:

VBench (Video Generation Quality & Alignment)

Metrics:

VBench Total Score
Quality Score
Semantic Score
Vendi Score (Diversity)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DOLLAR outperforms both the Teacher model and strong baselines on VBench Total Score.
VBench	Total Score	80.25	82.57	+2.32
VBench	Total Score	81.01	82.57	+1.56
Ablation studies show that combining VSD and CD improves diversity compared to VSD alone.
Diversity Analysis	Vendi Score (Inception)	1.91	1.98	+0.07
Inference Time	Diffusion Time (% of Teacher)	91.94	5.88	-86.06

Experiment Figures

Human evaluation results comparing DOLLAR against Teacher, DDPO, and Gen-3 on metrics like visual quality and motion.

Training curves showing the Latent Reward Model (LRM) values increasing during fine-tuning for HPSv2 and PickScore.

Main Takeaways

Distilled student models can outperform their teachers in quality (VBench) by leveraging reward signals (LRM) during the distillation process.
Combining VSD and CD creates a synergy: VSD provides high fidelity distribution matching, while CD prevents the mode collapse typical of pure VSD.
Latent Reward Models (LRM) effectively approximate pixel-space rewards, enabling fine-grained alignment without the memory cost of decoding videos during training.
One-step generation is possible (278.6x speedup) but 4-step generation offers the best balance of quality and efficiency.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Probabilistic Models (DDPM/DDIM)
Knowledge Distillation (Teacher-Student training)
Latent Diffusion Models (LDM)
Reward-based fine-tuning (RLHF/RLAIF)

Key Terms

VSD: Variational Score Distillation—a method where a student model minimizes the divergence from a teacher's distribution using a learned score function

CD: Consistency Distillation—a technique enforcing that model predictions at different timesteps map to the same initial data point

LRM: Latent Reward Model—a compact proxy network trained to predict reward values directly from latent representations, bypassing the decoder

DiT: Diffusion Transformer—a diffusion model architecture using Transformers instead of U-Nets, scalable for video

HPSv2: Human Preference Score v2—a reward model predicting human aesthetic preference for images/videos

VBench: A comprehensive benchmark for evaluating video generation across dimensions like temporal consistency and imaging quality

CFG: Classifier-Free Guidance—a technique to improve prompt alignment by extrapolating between conditional and unconditional model predictions

NFE: Number of Function Evaluations—the number of times the neural network is called during inference (sampling steps)