DI*: Diff-Instruct*—the proposed post-training method that aligns one-step models using score-based divergence regularization
One-step generator: A generative model that maps noise to a final image in a single forward pass, unlike multi-step diffusion models
Score function: The gradient of the log-probability density with respect to the data; diffusion models learn to approximate this
Reward hacking: A failure mode in RL where the model exploits flaws in the reward function to get high scores without achieving the intended high-quality outcome (e.g., generating weird artifacts)
Pseudo-Huber distance: A robust loss function used here as a distance metric between score functions to regularize the training, combining properties of L1 and L2 norms
CFG: Classifier-Free Guidance—a technique in diffusion models that improves sample quality by mixing conditional and unconditional score estimates
Implicit Reward: A reward signal derived mathematically from the Classifier-Free Guidance formulation, used to align the model without an external reward model
Reference diffusion: A pre-trained, frozen diffusion model used as a ground-truth anchor to prevent the one-step model from forgetting realistic image statistics