Cascade RL: A two-stage reinforcement learning framework using offline RL for initial alignment and online RL for refinement
MPO: Mixed Preference Optimization—an offline RL algorithm combining preference, quality, and generation losses
GSPO: General Self-Play Optimization—an online RL algorithm that refines the policy using self-generated rollouts
ViR: Visual Resolution Router—a module that dynamically selects the compression rate (resolution) for image patches
ViCO: Visual Consistency Learning—a training stage to integrate ViR by minimizing divergence between high and low-resolution outputs
DvD: Decoupled Vision-Language Deployment—an inference strategy placing the vision encoder and LLM on separate GPUs to maximize parallelism
MoE: Mixture-of-Experts—a model architecture where only a subset of parameters (experts) are active for each token
NTP: Next Token Prediction—the standard autoregressive loss used in language model pre-training
SFT: Supervised Fine-Tuning—training on high-quality labeled data to align model behavior