CoT: Chain-of-Thought—a prompting or training strategy where the model generates intermediate reasoning steps before the final answer
UniGRPO: Unified Group Relative Policy Optimization—a proposed RL algorithm for diffusion models that estimates policy gradients by sampling diverse mask ratios
Masked Token Predictor: A model trained to predict the original identity of tokens that have been replaced with a special [MASK] token
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same input, removing the need for a separate value function critic
MAGVIT-v2: A specific image tokenizer that compresses images into discrete codes (tokens), enabling transformer-based modeling
KL divergence: A statistical measure of how one probability distribution differs from a second, reference probability distribution
Diffusion Model: A generative model that learns to reverse a process of gradually adding noise (or masks) to data
SDXL: Stable Diffusion XL—a popular text-to-image generation model
Non-autoregressive: Generating all tokens (or groups of tokens) in parallel rather than one by one from left to right