GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of sampled outputs for the same input, avoiding the need for a separate critic model
MLLM: Multimodal Large Language Model—an AI model capable of processing and generating information across multiple modalities like text, image, and audio
CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer
PPO: Proximal Policy Optimization—a standard RL algorithm that updates policies using a clipped objective function to ensure stability
RLHF: Reinforcement Learning from Human Feedback—aligning models using rewards derived from human preferences
aha moments: Instances where a model self-corrects its reasoning chain, revisiting initial assumptions to reach a correct conclusion
Qwen2.5-Omni: The base multimodal foundation model used in this paper, capable of end-to-end speech and text interaction