VLO: Variational Lower Bound—the standard training objective for diffusion models, maximizing the likelihood of the data.
Q-weighted VLO: The proposed objective function where the VLO is weighted by the Q-value (expected return) to align diffusion training with reward maximization.
DDPM: Denoising Diffusion Probabilistic Models—generative models that learn to reverse a gradual noise-adding process to generate data.
ELBO: Evidence Lower Bound—often synonymous with VLO in variational inference contexts.
DIPO: Diffusion Policy Optimization—a prior method using Q-gradients to update actions in the replay buffer before diffusion training.
QSM: Q-Score Matching—a prior method aligning the diffusion score function with Q-function gradients.
SAC: Soft Actor-Critic—a standard maximum entropy RL algorithm using Gaussian policies.
MuJoCo: Multi-Joint dynamics with Contact—a physics engine used as a standard benchmark for continuous control RL.