DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

📝 Paper Summary

Text-to-Image Generation RLHF (Reinforcement Learning from Human Feedback)

DPOK fine-tunes text-to-image diffusion models using online reinforcement learning with KL regularization to maximize human-aligned rewards while preserving image quality.

Core Problem

Supervised fine-tuning of diffusion models on fixed reward-weighted datasets often degrades image quality (e.g., oversaturation) and fails to fully optimize human alignment rewards.

Why it matters:

Current text-to-image models struggle with specific requirements like counting, attribute binding, and compositionality
Supervised fine-tuning on static datasets limits the model's ability to explore and maximize rewards beyond the pre-trained distribution
Without proper regularization, maximizing reward models often collapses image diversity or fidelity (reward hacking)

Concrete Example: When fine-tuning on the prompt 'A green colored rabbit', supervised methods often produce over-saturated, unnatural images. The original model also biases prompts like 'Four roses' to whiskey bottles rather than flowers.

Key Novelty

Diffusion Policy Optimization with KL regularization (DPOK)

Frames the diffusion denoising process as a multi-step MDP where the policy is the denoising network and the reward is given only at the final image step
Updates the model using Policy Gradient (REINFORCE) on online samples generated by the current model, allowing it to explore new high-reward regions
incorporates KL divergence from the pre-trained model as an implicit reward to prevent mode collapse and preserve image fidelity

Architecture

Conceptual comparison between Supervised fine-tuning and RL fine-tuning workflows.

Evaluation Highlights

Outperforms Supervised Fine-Tuning (SFT) in human evaluation, with win rates of ~70% for alignment and ~60% for image quality on test prompts
Increases ImageReward score from 0.13 to 0.58 on Drawbench prompts while maintaining aesthetic quality
Corrects dataset bias: Fine-tuning on 'Four roses' shifts generation from whiskey bottles to flowers (ImageReward -0.52 → 1.12)

Breakthrough Assessment

8/10

Significant step in applying online RL to large-scale diffusion models. Demonstrates that online exploration outperforms supervised methods for alignment, addressing a key limitation in generative AI fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a conditional diffusion model p_theta(x_0|z) to maximize expected reward r(x_0, z) while staying close to a pre-trained model p_pre

Inputs: Text prompt z

Outputs: Generated image x_0

Pipeline Flow

Prompt Sampling (z ~ p(z))
Online Image Generation (x_0 ~ p_theta(.|z))
Reward Calculation (r(x_0, z) + KL penalty)
Policy Update (Gradient Ascent on theta)

System Modules

Diffusion Model (Policy)

Generate images from text prompts via iterative denoising

Model or implementation: Stable Diffusion v1.5 (UNet with LoRA)

Reward Model

Evaluate alignment of generated image with prompt

Model or implementation: ImageReward

Novel Architectural Elements

Framing the diffusion denoising chain as a T-step MDP where action a_t is the predicted next latent x_{t-1}
Upper-bound approximation of KL divergence between diffusion marginals using sum of step-wise conditional KL divergences

Modeling

Base Model: Stable Diffusion v1.5

Training Method: Online Reinforcement Learning (REINFORCE with value baseline)

Objective Functions:

Purpose: Maximize expected reward while minimizing divergence from pre-trained model.

Formally: E[r(x_0, z) - beta * KL(p_theta || p_pre)]
Purpose: Update policy via gradient ascent.

Formally: grad J = E [ (r(x_0, z) - V(x_t)) * sum(grad log p_theta(x_{t-1}|x_t, z)) ]

Adaptation: LoRA (applied to UNet module)

Trainable Parameters: Low-rank adapters only

Training Data:

Prompts from MS-COCO (104 prompts)
Prompts from DrawBench (183 prompts)
Simple custom prompts for analysis (e.g., 'A green colored rabbit')

Key Hyperparameters:

learning_rate: 1e-5 (SFT) / 1e-5 to 3e-5 (RL)
batch_size: 4 to 16
gradient_accumulation: 1 to 4
+ 3 more
kl_coefficient_beta: 0.01 to 0.1
training_steps: Varied (e.g., 2000 for single prompt, 6000 for multiple)
LoRA_rank: 4 or 8 or 16 (implied by standard LoRA usage, exact rank not explicitly detailed in main text, code link provided)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Reward-Weighted Regression: DPOK uses online sampling and updates, whereas Lee et al. use offline supervised datasets. DPOK integrates KL as an implicit reward term rather than a static loss component.
vs. DDPO: DPOK explicitly analyzes and incorporates KL regularization w.r.t. the pre-trained model, framing it as essential for maintaining image quality [DDPO is cited as concurrent work with similar MDP formulation].

Limitations

Computational cost of online sampling is higher than supervised fine-tuning
Fine-tuning on multiple prompts requires careful hyperparameter tuning and longer training
Performance depends on the quality and robustness of the reward model (ImageReward)
RL fine-tuning can be unstable without proper regularization (KL)

Reproducibility

Code: https://github.com/google-research/google-research/tree/master/dpok

Code is publicly available. Uses open-source models (Stable Diffusion v1.5, ImageReward). Hyperparameters provided for specific experiments in Appendix B.

📊 Experiments & Results

Evaluation Setup

Text-to-image generation on specific prompts testing alignment (color, count, etc.) and general prompts from benchmarks.

Benchmarks:

Custom Prompts (Specific alignment tasks (Color, Count, Composition, Location)) [New]
MS-COCO (General text-to-image generation)
DrawBench (Challenging text-to-image generation)

Metrics:

ImageReward score (alignment)
Aesthetic score (image quality)
Human preference (win rate)
Statistical methodology: Human evaluation used 8 independent raters per query. Mean and standard deviation reported for human eval.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MS-COCO (104 prompts)	ImageReward score	0.22	0.55	+0.33
MS-COCO (104 prompts)	Aesthetic score	5.39	5.43	+0.04
DrawBench (183 prompts)	ImageReward score	0.13	0.58	+0.45
Single Prompt ('A green colored rabbit')	ImageReward score	0.75	1.45	+0.70
Custom Prompts	Win Rate (Alignment)	15	70	+55
Custom Prompts	Win Rate (Image Quality)	20	60	+40

Experiment Figures

Quantitative comparison (ImageReward, Aesthetic Score) and Human Evaluation between Original, SFT, and RL models.

Ablation of KL regularization.

Main Takeaways

Online RL (DPOK) consistently achieves higher alignment rewards than Supervised Fine-Tuning (SFT) because it optimizes against the reward model on its own distribution.
SFT often leads to image quality degradation (e.g., oversaturation), whereas DPOK with KL regularization maintains photorealism.
KL regularization is critical: without it, RL models can produce unnatural images; SFT benefits less from KL but still sees some quality preservation.
RL fine-tuning can correct dataset biases (e.g., 'Four roses' whiskey -> flowers) by leveraging human-feedback reward models.

📚 Prerequisite Knowledge

Prerequisites

Denoising Diffusion Probabilistic Models (DDPM)
Reinforcement Learning (Policy Gradient/REINFORCE)
KL Divergence
LoRA (Low-Rank Adaptation)

Key Terms

DPOK: Diffusion Policy Optimization with KL regularization—the proposed online RL algorithm

SFT: Supervised Fine-Tuning—training on a fixed dataset of high-reward samples rather than exploring online

ImageReward: A reward model trained on human preference data to score text-image alignment

KL regularization: Penalizing the model for diverging too far from the pre-trained weights, used to maintain image quality

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model and trains small adapter matrices

DDPM: Denoising Diffusion Probabilistic Models—generative models that create data by iteratively removing noise

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

REINFORCE: A basic policy gradient algorithm in reinforcement learning that updates policies based on the return of sampled trajectories

aesthetic score: A metric predicting the visual appeal of an image, often used to filter low-quality generations