DPOK: Diffusion Policy Optimization with KL regularization—the proposed online RL algorithm
SFT: Supervised Fine-Tuning—training on a fixed dataset of high-reward samples rather than exploring online
ImageReward: A reward model trained on human preference data to score text-image alignment
KL regularization: Penalizing the model for diverging too far from the pre-trained weights, used to maintain image quality
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model and trains small adapter matrices
DDPM: Denoising Diffusion Probabilistic Models—generative models that create data by iteratively removing noise
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
REINFORCE: A basic policy gradient algorithm in reinforcement learning that updates policies based on the return of sampled trajectories
aesthetic score: A metric predicting the visual appeal of an image, often used to filter low-quality generations