MLLM: Multimodal Large Language Model—AI systems capable of processing and generating both text and visual data (images/videos).
MCQ: Multiple-Choice Question—a format where the model selects the correct answer from a list of options.
tIoU: Temporal Intersection over Union—a metric measuring the overlap between a predicted time interval and the ground-truth time interval.
Clue-Grounded: An evaluation approach where the model must identify the specific video segment (clue) that contains the information needed to answer the question.
White-box evaluation: An evaluation setting where the model is explicitly asked to output the timestamps of the relevant clue along with the answer.
Black-box evaluation: An evaluation setting that infers model reliability by comparing its performance on the full video versus its performance when given only the short clue clip.
CRR: Clue Recovery Rate—a metric measuring how well a model maintains its accuracy when processing the full long video compared to when it sees only the relevant clue clip.
Context Dilution: The phenomenon where a model's ability to retrieve relevant information degrades as the amount of irrelevant input (context length) increases.
Hallucination: In AI, when a model generates plausible-sounding but incorrect or factually baseless information.