INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

📝 Paper Summary

Video-LLM Hallucination Benchmarking Factuality and Faithfulness Evaluation

INFACT is a diagnostic benchmark that evaluates Video-LLM reliability by subjecting models to controlled visual degradations, evidence corruptions, and temporal interventions to measure stability and grounding beyond clean accuracy.

Core Problem

Existing Video-LLM benchmarks focus primarily on clean-setting accuracy and video-verifiable faithfulness, neglecting factuality errors (contradicting world knowledge) and shortcut learning where models ignore video evidence.

Why it matters:

High accuracy in clean settings often masks fragility; models may rely on language priors rather than actual video understanding
Factuality hallucinations (e.g., misidentifying historical events or physical laws) are under-explored compared to visual inconsistencies
Current benchmarks lack controlled perturbation protocols to distinguish true robust understanding from lucky guesses or static cue exploitation

Concrete Example: A Video-LLM might correctly answer a question about a procedure in a clean video but fail when subtitles are maliciously altered (e.g., subtitles say 'closing door' while video shows 'opening'), or persist in its prediction even when the video frames are shuffled, proving it wasn't actually tracking the temporal order.

Key Novelty

Diagnostic Benchmark with Induced Hallucination Modes

Introduces four evaluation modes: Base (clean), Visual Degradation (noise/blur), Evidence Corruption (misleading subtitles/adversarial noise), and Temporal Intervention (shuffling/reversal)
Distinguishes between Faithfulness (video-conflict) and Factuality (world-knowledge-conflict) with a fine-grained taxonomy covering static entities, dynamics, and external knowledge domains
Proposes specific reliability metrics: Resist Rate (RR) for stability under noise, and Temporal Sensitivity Score (TSS) to measure if models actually react to destroyed temporal logic

Architecture

The hierarchical taxonomy of the INFACT benchmark

Evaluation Highlights

Many open-source baselines exhibit near-zero Temporal Sensitivity Score (TSS) on factuality questions, indicating they ignore temporal order entirely
Evidence corruption (e.g., misleading subtitles) degrades model reliability significantly more than visual degradation (e.g., blur/noise)
Temporal intervention causes the largest performance degradation, revealing that high base accuracy often relies on static cues rather than temporal understanding

Breakthrough Assessment

8/10

Significant contribution by formalizing 'induced' hallucination modes and explicitly separating factuality from faithfulness. The finding that models possess near-zero temporal sensitivity for factuality is a strong diagnostic insight.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering (VideoQA) under clean and perturbed conditions to detect hallucinations

Inputs: Video V and Question q

Outputs: Predicted answer (multiple choice or judgment)

Pipeline Flow

Data Curation (Mapping existing datasets to taxonomy + Synthetic generation)
Filtration (LLM-ensemble + Human-in-the-loop verification)
Induction (Applying 4 modes: Base, Visual Degradation, Evidence Corruption, Temporal Intervention)
Evaluation (Computing Accuracy, Resist Rate, and Temporal Sensitivity Score)

System Modules

Data Constructor

Curates 9,800 QA instances from MVBench, Video-MME, TOMATO, etc., and synthesizes physical anomaly videos via Sora/Wan2.5

Model or implementation: Sora, Wan2.5, Gemini Veo 3 (for synthetic generation)

Induction Engine (Evaluation)

Applies perturbations to videos based on the selected mode

Model or implementation: Various image processing tools + MI-FGSM for adversarial noise

Evaluator (Evaluation)

Scores model outputs against ground truth and base predictions

Model or implementation: Deterministic scoring functions

Novel Architectural Elements

Four-mode evaluation protocol distinguishing between invariant-label perturbations (for stability) and label-destroying interventions (for sensitivity)

Comparison to Prior Work

vs. MVBench/Video-MME: INFACT evaluates under induced noise/corruption rather than just clean settings
vs. EventHallusion/MHBench: INFACT includes a comprehensive Factuality taxonomy (World Knowledge) alongside Faithfulness
vs. VidHalluc: INFACT uses explicit evidence corruption (subtitles, adversarial noise) and temporal intervention metrics (TSS)
+ 1 more
vs. Simpson et al. (2024) [not cited in paper]: Simpson et al. focus on mis-localization in time; INFACT focuses on measuring resistance to corruption and sensitivity to order shuffling

Limitations

Reliability metrics (RR, TSS) depend on the model getting the Base prediction correct first; low Base accuracy reduces the sample size for reliability analysis
Adversarial noise transferability might vary across target model architectures despite using an ensemble proxy
Factuality evaluation relies on the premise that the 'world knowledge' is static and verifiable, which can be challenging for rapidly evolving topics

Reproducibility

The paper does not explicitly provide a code URL or repository link in the text. It mentions using open-source models (InternVL3, Qwen3VL, Video-MAE) for proxy attacks. Data sources (MVBench, Video-MME, etc.) are public.

📊 Experiments & Results

Evaluation Setup

Evaluation of 14 representative Video-LLMs on 9,800 QA instances

Benchmarks:

INFACT (Video Question Answering (Faithfulness & Factuality)) [New]

Metrics:

Base Accuracy
Resist Rate (RR) - for invariant perturbations
Temporal Sensitivity Score (TSS) - for temporal interventions
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General findings on model reliability metrics across different perturbation modes.
INFACT (Factuality)	TSS (Temporal Sensitivity Score)	Not reported in the paper	near-zero	Not reported in the paper
Comparative analysis of perturbation types showing evidence corruption is more damaging than visual degradation.
INFACT	Impact Analysis	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

Higher accuracy in Base mode (clean) does not reliably predict higher reliability (RR/TSS) in induced modes.
Evidence corruption (e.g., misleading subtitles) significantly reduces stability compared to simple visual noise.
Temporal intervention (shuffling) yields the largest degradation, exposing that models often ignore temporal structure even when they answer correctly in clean settings.
There is a pronounced 'temporal inertia' in factuality questions, where models stick to their priors regardless of video order.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Video-LLM architectures
Familiarity with hallucination types (faithfulness vs. factuality)
Basic knowledge of adversarial attacks (e.g., FGSM)

Key Terms

Faithfulness: Consistency between the model's output and the visual evidence present in the video

Factuality: Consistency between the model's output and verifiable world knowledge (e.g., history, physics, procedures)

RR: Resist Rate—measures the percentage of correct base predictions that remain correct after label-preserving perturbations (e.g., blur, noise)

TSS: Temporal Sensitivity Score—measures the percentage of correct base predictions that change when the video's temporal order is destroyed (shuffled/reversed)

Video-LLM: Large Language Models adapted for video inputs, typically using a visual encoder and an LLM backbone

MI-FGSM: Momentum Iterative Fast Gradient Sign Method—an adversarial attack algorithm used here to generate visual noise

hallucination: Generated content that contradicts provided evidence (faithfulness) or world knowledge (factuality)

temporal inertia: The tendency of a model to retain its original prediction even when the temporal evidence required for that prediction is destroyed