VMLLM: Video Multimodal Large Language Model—AI systems that process both video and text inputs to generate text
SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to specific tasks
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that updates policies based on the relative performance of a group of generated outputs
InfoNCE: A contrastive loss function used to maximize agreement between positive pairs (e.g., related text and video) while minimizing agreement with negative pairs
transient event: A very short duration event (e.g., 1-2 seconds) embedded within a much longer video, often carrying critical semantic meaning
Q-Former: A module that acts as a bridge between visual encoders and language models, compressing visual features into a fixed number of tokens
BLIP-2: A vision-language model architecture used here to compute semantic similarity between frames and text keywords for keyframe selection