DPO: Direct Preference Optimization—an alignment method that optimizes a model to prefer 'chosen' responses over 'rejected' ones without a separate reward model
SFT: Supervised Fine-Tuning—training a model on high-quality instruction-response pairs to learn how to follow user commands
Pre-SFT Alignment: The authors' proposed strategy of running DPO *before* SFT, contrary to the standard practice of running it afterwards
Temporal Perturbation: Deliberately corrupting video inputs (e.g., reversing clips, shuffling order) to create 'rejected' examples where the model's output is likely incorrect or confused
Curriculum Learning: A training strategy where the difficulty of tasks (in this case, the subtlety of perturbations) increases over time
SigLIP: A specific vision-language model used here to compute similarity between video frames for filtering redundant content
TransNetV2: A deep learning model specifically designed for detecting shot boundaries and scene transitions in videos