MaxEnt-RL: Maximum Entropy Reinforcement Learning—an RL framework that maximizes both the expected reward and the entropy of the policy to encourage exploration.
Ornstein-Uhlenbeck (OU) process: A mean-reverting stochastic process used here as the forward noising process for the diffusion model.
Bellman backup operator: A recursive operator used to update the Q-function (expected future reward) based on the current reward and the value of the next state.
ELBO: Evidence Lower Bound—a variational lower bound on the log-likelihood (or entropy in this context) used to make optimization tractable.
CrossQ: An off-policy RL algorithm that removes the target network by using batch renormalization to stabilize Q-learning.
Distributional RL: An RL approach that learns the full distribution of returns rather than just the expected value.
Score matching: A technique to learn the gradient of the log-probability density (the score function) of a data distribution.
IQM: Interquartile Mean—a robust statistical aggregate metric that ignores the lowest and highest 25% of results to reduce the impact of outliers.
Batch Renormalization: A technique to make batch normalization effective for small or non-i.i.d. minibatches, used here to stabilize Q-learning without target networks.