SFT: Supervised Fine-Tuning—training a model on labeled instructions to establish basic behavior and formatting
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies using group-relative advantages from multiple sampled outputs, removing the need for a separate critic model
CoT: Chain-of-Thought—a reasoning technique where models generate intermediate steps before the final answer to improve accuracy
Test-Time Scaling (TTS): Techniques applied during inference (like generating multiple answers and voting) to improve performance without changing model weights
Verifiable Rewards: Objective outcome measures (e.g., correct answer format, temporal intersection-over-union) used in RL instead of subjective human preferences
Temporal IoU (tIoU): Temporal Intersection over Union—a metric measuring the overlap between a predicted time segment and the ground truth segment in a video
Cold-start: An initial SFT phase using high-quality data to stabilize a model before applying Reinforcement Learning
PPO: Proximal Policy Optimization—a standard RL algorithm that updates policies with clipped objectives to prevent instability
DPO: Direct Preference Optimization—an alignment method that optimizes policies directly on preference pairs without an explicit reward model
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters
Hallucination: When a model generates plausible but incorrect information not supported by the video content
MCTS: Monte Carlo Tree Search—a decision-making algorithm that explores possible future steps to find optimal reasoning paths