GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs generated from the same input to reduce variance
DGRPO: Difficulty-aware Group Relative Policy Optimization—the authors' proposed variant that scales rewards based on task-specific and sample-specific difficulty weights
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
SFT: Supervised Fine-Tuning—training the model on labeled data before applying reinforcement learning
IoU: Intersection over Union—a metric measuring the overlap between the predicted time range and the ground truth time range in temporal grounding
Temporal Grounding: The task of identifying the specific start and end timestamps of an event described in text within a video
Hallucination: The generation of factually incorrect information or details not present in the source content (video)
Visual Toolbox: A set of external functions (e.g., video clipping) the model can invoke to process visual data during generation