Faithfulness: Consistency between the model's output and the visual evidence present in the video
Factuality: Consistency between the model's output and verifiable world knowledge (e.g., history, physics, procedures)
RR: Resist Rate—measures the percentage of correct base predictions that remain correct after label-preserving perturbations (e.g., blur, noise)
TSS: Temporal Sensitivity Score—measures the percentage of correct base predictions that change when the video's temporal order is destroyed (shuffled/reversed)
Video-LLM: Large Language Models adapted for video inputs, typically using a visual encoder and an LLM backbone
MI-FGSM: Momentum Iterative Fast Gradient Sign Method—an adversarial attack algorithm used here to generate visual noise
hallucination: Generated content that contradicts provided evidence (faithfulness) or world knowledge (factuality)
temporal inertia: The tendency of a model to retain its original prediction even when the temporal evidence required for that prediction is destroyed