Aha moment: The point during RL training where a model spontaneously develops complex behaviors like self-reflection, error correction, and longer reasoning chains without explicit supervision
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs from the same input, removing the need for a separate value function critic
SFT: Supervised Fine-Tuning—training a model on labeled instruction-response pairs; this paper finds it detrimental to emergent reasoning in this context
Reward hacking: When a model optimizes for the reward metric (e.g., length) in a way that violates the intent (e.g., generating gibberish to increase length)
CVBench: A vision-centric benchmark for evaluating 2D and 3D spatial reasoning capabilities
SAT: Spatial Aptitude Test dataset—a VQA dataset with 218k examples used here for training
VSR: Visual Spatial Reasoning benchmark