UMM: Unified Multimodal Model—a single model handling both understanding (image-to-text) and generation (text-to-image) within shared parameters
RLVR: Reinforcement Learning with Verifiable Rewards—training method where the reward is a binary correctness check (verifiable) rather than a learned scalar score
RLMT: Reinforcement Learning with Model-rewarded Thinking—training method where the model generates a 'reasoning trace' before the final answer, rewarded by a learned model
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to stabilize training without a separate value network
Endogenous Reprompting: A mechanism where the model uses its own internal understanding to rewrite prompts for itself, ensuring the new prompts match its own generative capabilities
Cognitive Gap: The discrepancy between a model's high performance in understanding tasks (e.g., VQA) and low performance in generation tasks using the same knowledge
Bradley-Terry model: A statistical model for estimating the probability that one item is preferred over another based on their scores
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution