Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Diffusion Model Fine-tuning Preference Optimization

D3PO fine-tunes diffusion models directly from human preference data without a separate reward model by formulating the denoising process as a multi-step MDP, reducing memory usage and training costs.

Core Problem

Fine-tuning diffusion models with RLHF typically requires training an expensive 'reward model' first, and applying direct preference methods (like DPO) from LLMs to diffusion is memory-prohibitive due to the multi-step denoising process.

Why it matters:

Training robust reward models (like ImageReward) requires massive datasets and extensive human labor, making alignment expensive
Directly applying LLM-based DPO to diffusion would require storing gradients for entire image generation trajectories, causing unsustainable GPU memory consumption
Current methods struggle to efficiently fix specific image defects (e.g., deformed hands) without complex reward engineering

Concrete Example: To fix deformed hands in generated images, standard methods first need a large dataset of deformed vs. normal hands to train a reward model. D3PO skips this step and updates the generator directly from human choices (Dataset A is better than Dataset B).

Key Novelty

Direct Preference for Denoising Diffusion Policy Optimization (D3PO)

Conceptually treats the diffusion denoising steps as a multi-step Markov Decision Process (MDP) rather than a single-step generation
Demonstrates mathematically that updating the policy directly based on preferences in this MDP is equivalent to learning an optimal reward model and then using it
Bypasses the need for a separate reward network, saving memory and allowing direct optimization from human feedback data (A > B)

Evaluation Highlights

Reduced rate of images with abnormal hands from 28.4% (Stable Diffusion v1.4) to 13.9% using human feedback
Reduced safety violations (NSFW content) from 25.0% to just 3.0% in safety-alignment experiments
Achieved a 70.3% win rate against the base Stable Diffusion v1.4 model in human preference evaluations

Breakthrough Assessment

8/10

Significant efficiency breakthrough. Eliminating the reward model aligns diffusion models much cheaper and faster, adapting the successful DPO paradigm from LLMs to the specific mathematical constraints of diffusion.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning a conditional generative diffusion model p(x|c) using a dataset of human preferences D = {(x, y_w, y_l)}

Inputs: Text prompt c (conditioning information)

Outputs: Generated image x0 aligned with human preference (e.g., aesthetic, safe, non-deformed)

Pipeline Flow

Input Prompt -> [Stable Diffusion UNet (Policy)] -> Denoising Steps (T to 0) -> Final Image
Feedback Loop: Image Pair -> Human Preference -> D3PO Loss Update -> UNet Weights

System Modules

Stable Diffusion UNet

Predicts the noise or mean of the previous latent state (Action Value Function in MDP terms)

Model or implementation: Stable Diffusion v1.4 or v1.5

Novel Architectural Elements

MDP Formulation of Denoising: The denoising process is structurally mapped to an MDP where the UNet acts as the policy/action-value function, enabling step-wise RL updates
Reward-Free Optimization Loop: The pipeline removes the 'Reward Model' component found in standard RLHF, connecting human preference data directly to policy updates via the D3PO loss

Modeling

Base Model: Stable Diffusion v1.4 / v1.5

Training Method: Direct Preference for Denoising Diffusion Policy Optimization (D3PO)

Objective Functions:

Purpose: Optimize policy to favor preferred trajectories while staying close to reference model.

Formally: Loss involves log-ratio of policy/reference probabilities weighted by sigmoid of implicit reward difference (Eq. 12 in paper)

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA weights added to the UNet attention layers

Training Data:

Human preference pairs (winner/loser images)
Pick-a-Pic dataset
Custom datasets for deformity and safety

Compute: 1 NVIDIA A800 GPU; Training time 20-40 minutes (extremely efficient)

Comparison to Prior Work

vs. DDPO: D3PO eliminates the need for the reward model entirely, reducing cost and complexity
vs. DPO (for LLMs): D3PO adapts DPO to multi-step MDPs, whereas standard DPO treats generation as a single step (which is memory-impossible for diffusion)
vs. ReFL: D3PO is a direct optimization method, whereas ReFL requires a pre-trained ImageReward network

Limitations

Relies on the assumption that cumulative rewards follow a normal distribution, which may not always hold perfectly
Requires human preference data (or a proxy) which can still be expensive to collect if not using existing datasets
Performance depends on the quality and size of the preference dataset

Reproducibility

Code: https://github.com/yk7333/D3PO

Code is publicly available at https://github.com/yk7333/D3PO. The paper utilizes standard pre-trained models (Stable Diffusion) and public datasets (Pick-a-Pic) or custom collected prompts described in the text.

📊 Experiments & Results

Evaluation Setup

Fine-tuning Stable Diffusion on specific tasks (Hand/Body generation, Safety) and measuring improvement via human evaluation and proxy metrics.

Benchmarks:

Deformity Correction (Image Quality / Anatomy) [New]
Safety Alignment (Harmlessness / NSFW filtering) [New]
Prompt-Image Alignment (General text-to-image alignment)

Metrics:

Human Preference Win Rate
Abnormal Rate (for hands/bodies)
Unsafe Rate (for NSFW)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
D3PO significantly reduces image defects compared to the base Stable Diffusion model.
Hand Generation	Abnormal Rate	28.4	13.9	-14.5
Body Generation	Abnormal Rate	34.4	17.4	-17.0
NSFW Prompts	Unsafe Rate	25.0	3.0	-22.0
D3PO achieves high win rates against the base model in human evaluations.
General Preference	Win Rate vs SD v1.4	50.0	70.3	+20.3

Experiment Figures

Qualitative comparison of generated images (hands and full-body shots) before and after fine-tuning.

Main Takeaways

D3PO effectively fine-tunes diffusion models to reduce specific defects (hands, bodies) and improve safety without needing a separate reward model.
The method is highly efficient, requiring only 20-40 minutes on a single A800 GPU to achieve these results.
Theoretical analysis confirms that direct preference optimization in the MDP setting acts as an optimal reward model guiding the policy.
The approach overcomes the memory bottleneck that prevents applying standard LLM-DPO to diffusion models.

📚 Prerequisite Knowledge

Prerequisites

Denoising Diffusion Probabilistic Models (DDPM)
Reinforcement Learning from Human Feedback (RLHF)
Markov Decision Processes (MDP)
Direct Preference Optimization (DPO)

Key Terms

D3PO: Direct Preference for Denoising Diffusion Policy Optimization—the proposed method to fine-tune diffusion models directly from preferences without a reward model

RLHF: Reinforcement Learning from Human Feedback—a technique to align AI models with human intent using rewards derived from human ratings

MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker

DPO: Direct Preference Optimization—an algorithm originally for LLMs that optimizes policies directly from preference pairs (winner/loser) without an explicit reward model

DDPO: Denoising Diffusion Policy Optimization—a prior method that treats denoising as an MDP but typically requires a separate reward model

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights to save memory

Stable Diffusion: A popular open-source text-to-image diffusion model used as the base model in this paper

UNet: The neural network architecture used within Stable Diffusion to predict noise at each step

Dirac delta distribution: A distribution representing a point mass, used here to describe deterministic state transitions in the simplified MDP formulation