VLM: Vision-Language Model—an AI model that processes both images/video and text to generate text outputs
Grounding: The ability of a model to link textual concepts to specific pixels or timeframes in the visual input (e.g., bounding boxes or points)
SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs to teach it to follow instructions
ViT: Vision Transformer—a neural network architecture that processes images by splitting them into patches
Message-tree: A data structure used during training where a single visual input is the root, and multiple distinct QA pairs are branches, packed into one sequence with masking to prevent cross-contamination
Packing: Combining multiple short training examples into a single long sequence to maximize GPU efficiency
Token-weighting: Assigning different loss weights to tokens from different tasks (e.g., lower weights for long captions) to balance learning
J&F: Jaccard and F-measure—a standard metric for evaluating video object segmentation accuracy
Elo Score: A comparative ranking system used here to measure human preference between model outputs
Distillation: Training a smaller student model using outputs from a larger, often proprietary, teacher model