VideoLM: Video Language Model—AI systems that extend Large Language Models to perceive and reason about video content
TTFT: Time-to-first-token—the latency between sending a request and receiving the first generated word, dominated by processing the video input
GOP: Group of Pictures—a specific arrangement of frames in video compression (e.g., one I-frame followed by many P-frames)
I-frame: Intra-coded frame—a fully specified image in a video stream, serving as a reference point (like a JPEG)
P-frame: Predictive frame—a video frame encoded only as changes (motion/residuals) relative to a previous frame
Motion Vectors: Data in compressed video describing how blocks of pixels move from one frame to the next (optical flow approximation)
Residuals: The error or difference between the predicted frame (moved by motion vectors) and the actual target frame
Delta-tokens: The novel lightweight tokens proposed by this paper, representing the information in P-frames (motion + residuals)
SigLIP: Sigmoid Loss for Language Image Pre-training—a vision encoder used to extract features from images
Qwen2: A specific family of Large Language Models used here as the reasoning backbone