GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of sampled outputs to estimate advantages without a value function
Cold Start: The initial phase of training where a model is Supervised Fine-Tuned (SFT) on high-quality data to establish a baseline capability before Reinforcement Learning
SFT: Supervised Fine-Tuning—training a model on labeled examples to teach it specific behaviors or formats
Chain-of-Thought: A reasoning technique where the model generates intermediate steps before producing the final answer
Distilled-CoT: Training data generated by a larger, more capable 'teacher' model (e.g., 32B) to teach a smaller 'student' model (e.g., 3B)
Aha Moment: A reflective pattern where a model seemingly pauses to re-evaluate its reasoning (e.g., 'Wait, let me check'), often associated with self-correction
Rejection Sampling: A data filtering method where multiple responses are generated, and only those that match the correct ground truth answer are kept for training
KL divergence: A measure of how much a probability distribution differs from a reference distribution, used here to prevent the RL model from drifting too far from the original model
Effective Rank: A metric measuring the effective dimensionality of the matrix formed by the hidden states of the model, often correlated with the amount of knowledge encoded
Qwen2.5-VL: The base Vision-Language Model family used in this paper, capable of processing both text and images