D3PO: Direct Preference for Denoising Diffusion Policy Optimization—the proposed method to fine-tune diffusion models directly from preferences without a reward model
RLHF: Reinforcement Learning from Human Feedback—a technique to align AI models with human intent using rewards derived from human ratings
MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker
DPO: Direct Preference Optimization—an algorithm originally for LLMs that optimizes policies directly from preference pairs (winner/loser) without an explicit reward model
DDPO: Denoising Diffusion Policy Optimization—a prior method that treats denoising as an MDP but typically requires a separate reward model
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights to save memory
Stable Diffusion: A popular open-source text-to-image diffusion model used as the base model in this paper
UNet: The neural network architecture used within Stable Diffusion to predict noise at each step
Dirac delta distribution: A distribution representing a point mass, used here to describe deterministic state transitions in the simplified MDP formulation