← Back to Paper List

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Qimai Li, Weihan Shen, Xiaolong Zhu, Xiu Li
Tsinghua Shenzhen International Graduate School, Tsinghua University, Department of Automation, Tsinghua University, Parametrix Technology Company Ltd.
Computer Vision and Pattern Recognition (2023)
RL MM

📝 Paper Summary

Reinforcement Learning from Human Feedback (RLHF) Diffusion Model Fine-tuning Preference Optimization
D3PO fine-tunes diffusion models directly from human preference data without a separate reward model by formulating the denoising process as a multi-step MDP, reducing memory usage and training costs.
Core Problem
Fine-tuning diffusion models with RLHF typically requires training an expensive 'reward model' first, and applying direct preference methods (like DPO) from LLMs to diffusion is memory-prohibitive due to the multi-step denoising process.
Why it matters:
  • Training robust reward models (like ImageReward) requires massive datasets and extensive human labor, making alignment expensive
  • Directly applying LLM-based DPO to diffusion would require storing gradients for entire image generation trajectories, causing unsustainable GPU memory consumption
  • Current methods struggle to efficiently fix specific image defects (e.g., deformed hands) without complex reward engineering
Concrete Example: To fix deformed hands in generated images, standard methods first need a large dataset of deformed vs. normal hands to train a reward model. D3PO skips this step and updates the generator directly from human choices (Dataset A is better than Dataset B).
Key Novelty
Direct Preference for Denoising Diffusion Policy Optimization (D3PO)
  • Conceptually treats the diffusion denoising steps as a multi-step Markov Decision Process (MDP) rather than a single-step generation
  • Demonstrates mathematically that updating the policy directly based on preferences in this MDP is equivalent to learning an optimal reward model and then using it
  • Bypasses the need for a separate reward network, saving memory and allowing direct optimization from human feedback data (A > B)
Evaluation Highlights
  • Reduced rate of images with abnormal hands from 28.4% (Stable Diffusion v1.4) to 13.9% using human feedback
  • Reduced safety violations (NSFW content) from 25.0% to just 3.0% in safety-alignment experiments
  • Achieved a 70.3% win rate against the base Stable Diffusion v1.4 model in human preference evaluations
Breakthrough Assessment
8/10
Significant efficiency breakthrough. Eliminating the reward model aligns diffusion models much cheaper and faster, adapting the successful DPO paradigm from LLMs to the specific mathematical constraints of diffusion.
×