VTG: Video Temporal Grounding—locating precise video segments (start/end times) relevant to a text query
MoE: Mixture-of-Experts—a neural network architecture where different sub-networks (experts) are activated for different inputs
saliency score: A scalar value indicating how relevant or important a specific video segment is to the query
routing: The process of deciding which expert network processes a given input token
gating network: A small neural network that calculates probabilities to select which experts to use
auxiliary loss: An additional training objective used to guide the model towards desired behaviors (like load balancing) without being the primary goal
IoU: Intersection over Union—a metric measuring the overlap between the predicted time segment and the ground truth segment
mAP: mean Average Precision—a metric summarizing precision-recall curves, commonly used in detection tasks
CIDEr: A metric for evaluating image/video captioning quality based on consensus with human references
SODA_c: A metric tailored for video storytelling evaluation, measuring the semantic coherence of the generated story
Video-LLM: Large Language Models adapted to process video inputs, typically by projecting visual features into the LLM's token space