RLVR: Reinforcement Learning with Verifiable Rewards—a training method where the model is rewarded based on objectively verifiable outcomes (like a correct answer) rather than human preference scores
GRPO: Group Relative Policy Optimization—an RL algorithm that eliminates the critic model by normalizing rewards within a group of outputs generated from the same input to estimate advantages
Omni-multimodal: Models capable of processing and integrating multiple modalities (text, audio, video, image) simultaneously
Cold Start: An initial phase of Supervised Fine-Tuning (SFT) on a small, high-quality dataset to give the model basic capabilities before RL training begins
UAR: Unweighted Average Recall—a metric that calculates the average recall across all classes, treating each class equally regardless of sample size
WAR: Weighted Average Recall—a metric that calculates average recall weighted by the number of samples in each class
OOD: Out-of-Distribution—data that comes from a different distribution (e.g., different actors, setting, recording style) than the training data
KL-divergence: Kullback-Leibler divergence—a statistical distance measure used here to penalize the RL model if it drifts too far from the reference model's policy