Directly Fine-Tuning Diffusion Models on Differentiable Rewards

📝 Paper Summary

Diffusion Model Fine-tuning Alignment with Human Preferences

DRaFT fine-tunes diffusion models by backpropagating differentiable reward gradients directly through the sampling process, using truncation and variance reduction to achieve efficiency superior to reinforcement learning.

Core Problem

Diffusion models trained to match data distributions often fail to generate aesthetically pleasing images, and existing alignment methods like Reinforcement Learning (RL) are sample-inefficient while standard backpropagation is memory-prohibitive.

Why it matters:

Generative models need to satisfy complex human preferences (e.g., aesthetics) that are not captured by simple likelihood maximization on web data
RL-based fine-tuning ignores analytic gradients of differentiable reward functions, discarding useful information and slowing down training
Optimizing the latent noise (like DOODL) requires expensive optimization at inference time for every new prompt, whereas model fine-tuning amortizes this cost

Concrete Example: When a user prompts for an image, a standard model might generate a realistic but unappealing image. To fix this, methods like DOODL optimize the specific noise input for that one image (slow at inference), while RL methods treat the reward as a black box (slow to train). DRaFT updates the model weights directly using the reward's gradient.

Key Novelty

Direct Reward Fine-Tuning (DRaFT)

Treats the diffusion sampling chain like a recurrent neural network and backpropagates the gradient of the reward function through the denoising steps to update model parameters
Truncates backpropagation to the last K steps (DRaFT-K) to prevent exploding gradients and reduce compute, finding that optimizing just the end of the chain is sufficient
Averages gradients over multiple noise samples (DRaFT-LV) to reduce variance when using short backpropagation chains (K=1), improving learning efficiency

Architecture

Illustration of DRaFT-K (Truncated Backpropagation). It shows the sampling chain where gradients are only backpropagated through the last K steps.

Evaluation Highlights

Maximizes LAION Aesthetics scores >200x faster than RL algorithms (Black et al.) by leveraging analytic gradients
DRaFT-LV (Low Variance) learns roughly 2x faster than ReFL (Reward Feedback Learning) by averaging gradients over multiple noise samples
Successfully improves Stable Diffusion 1.4 aesthetic quality on PickScore and Human Preference Score v2 (qualitative result, exact delta not in text)

Breakthrough Assessment

7/10

Significantly improves efficiency over RL baselines for differentiable rewards. The unifying perspective on gradient-based fine-tuning is valuable, though reliance on differentiable rewards limits applicability compared to generic RL.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a pre-trained diffusion model to maximize a differentiable reward function defined on the final generated image

Inputs: Text prompt c and initial noise x_T

Outputs: Fine-tuned model parameters theta maximizing reward r(sample(theta, c, x_T))

Pipeline Flow

Input: Sample noise x_T and prompt c
Sampling Loop: Iteratively denoise x_T using UNet to get x_0 (differentiable chain)
Reward Calculation: Compute reward r(x_0, c)
Backward Pass: Backpropagate gradient of r through sampling steps to UNet parameters

System Modules

UNet Denoiser

Predicts noise to iteratively refine the latent image

Model or implementation: Stable Diffusion 1.4 UNet (with LoRA adapters)

Reward Model

Evaluates the final generated image and provides a differentiable score

Model or implementation: Differentiable reward function (e.g., LAION Aesthetic predictor)

Novel Architectural Elements

Truncated Backpropagation through Sampling: Stopping gradients after K steps (DRaFT-K)
Multi-sample Gradient Averaging: Computing gradients for n noise samples and averaging them (DRaFT-LV)

Modeling

Base Model: Stable Diffusion 1.4

Training Method: Gradient Ascent on Reward

Objective Functions:

Purpose: Maximize reward of generated image.

Formally: Maximize r(sample(theta, c, x_T), c)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA weights only

Key Hyperparameters:

K: Number of backpropagation steps (truncation length)
n: Number of noise samples to average over (for DRaFT-LV)
m: Number of steps used in ReFL baseline (set to 20)

Compute: Gradient Checkpointing used to fit in memory; DRaFT-1/LV adds ~10% compute overhead per step compared to standard sampling

Comparison to Prior Work

vs. RL: DRaFT uses analytic gradients of the reward, making it >200x more sample efficient
vs. ReFL: DRaFT-LV averages gradients over multiple samples (n=2), reducing variance and training 2x faster
vs. DOODL: DRaFT optimizes model parameters (LoRA) once, while DOODL optimizes latents per image (slow inference)
+ 1 more
vs. Fan & Lee (2023): DRaFT optimizes arbitrary rewards, while Fan & Lee focus on sampling speed [not cited in paper as reward optimization baseline]

Limitations

Requires the reward function to be differentiable (unlike RL)
Full backpropagation (K=T) can lead to exploding gradients (mitigated by truncation)
Memory intensive without gradient checkpointing and LoRA

Reproducibility

Code availability is not provided in the text. The method relies on standard JAX features (jax.checkpoint, stop_gradient).

📊 Experiments & Results

Evaluation Setup

Fine-tuning Stable Diffusion 1.4 on text prompts to maximize aesthetic/preference scores

Benchmarks:

LAION Aesthetics (Image Aesthetic Scoring)
PickScore (Human Preference Prediction)
Human Preference Score v2 (HPSv2) (Human Preference Prediction)

Metrics:

Reward Score (Maximization)
Training Speed (Efficiency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Efficiency comparisons demonstrate substantial speedups over RL and previous gradient-based methods.
LAION Aesthetics	Training Speedup	1.0	200.0	+199.0
Unknown (General)	Training Speedup	1.0	2.0	+1.0

Experiment Figures

Linear combination of fine-tuned LoRA weights.

Main Takeaways

Gradient-based fine-tuning (DRaFT) is orders of magnitude more efficient (>200x) than RL for differentiable rewards.
Truncating backpropagation to the last few steps (DRaFT-K) prevents exploding gradients and improves performance per training step compared to differentiating through the full chain.
Averaging gradients over multiple noise samples (DRaFT-LV) effectively reduces variance for short backprop chains (K=1), outperforming ReFL.
The method is versatile, applicable to various rewards including compressibility, object detection, and adversarial attacks (qualitative finding).

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (DDPM/DDIM sampling)
Backpropagation through time
Gradient Checkpointing
Low-Rank Adaptation (LoRA)

Key Terms

DRaFT: Direct Reward Fine-Tuning—the proposed method of backpropagating reward gradients through the diffusion sampling chain

LoRA: Low-Rank Adaptation—a technique to fine-tune models by freezing main weights and training small, low-rank matrices added to them

Gradient Checkpointing: A technique to reduce memory usage during backpropagation by not storing all intermediate activations and re-computing them when needed

ReFL: Reward Feedback Learning—a baseline method that updates models using gradients from a predicted clean image at a random intermediate timestep

DOODL: Direct Optimization of Diffusion Latents—a method that optimizes the input noise latent rather than model parameters

CFG: Classifier-Free Guidance—a technique to improve image-text alignment by linearly combining conditional and unconditional noise predictions

UNet: The neural network architecture typically used in diffusion models to predict noise

RLHF: Reinforcement Learning from Human Feedback—training models using rewards derived from human preferences

PickScore: A reward model trained on human preferences to predict which of two images is preferred