GSPO: Group Sequence Policy Optimization—an RL algorithm that optimizes at the sequence level rather than token level to stabilize chain-of-thought training
SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning
mAM: mean Average Match—a metric evaluating how well predicted answers align with ground truth in video tasks
mLGM: mean Localized Grounding Match—a metric specifically measuring the accuracy of spatio-temporal localization (bounding boxes + time)
IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box
CoT: Chain of Thought—intermediate reasoning steps generated by the model before the final answer
spatial collapse: A failure mode in training where the model fails to learn spatial localization because rewards are dependent on temporal accuracy, which is initially low
V-STAR: A video reasoning benchmark designed to evaluate spatio-temporal grounding capabilities