GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, removing the need for a separate value network critic
SDE: Stochastic Differential Equation—a differential equation where one or more terms are stochastic processes, used here to inject noise into the sampling process for exploration
DDPO: Denoising Diffusion Policy Optimization—a prior RL method for fine-tuning diffusion models using policy gradients
DPOK: Diffusion Policy Optimization with KL regularization—another prior RL method for diffusion models
Rectified Flow: A generative model framework that learns a transport map between noise and data distributions via Ordinary Differential Equations (ODEs)
CFG: Classifier-Free Guidance—a technique to improve sample quality by mixing conditional and unconditional score estimates
Best-of-N: An inference strategy where N samples are generated and the best one is selected based on a reward model; used here as a scaling strategy for training data
HPS-v2.1: Human Preference Score—a reward model trained to predict human aesthetic preferences for images
VideoAlign: A reward model for video generation assessing aesthetics, motion quality, and text alignment
ReFL: Reward-Weighted Fine-Tuning—a method that weights the training loss of diffusion models by the reward of the generated sample
MDP: Markov Decision Process—a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker