Bridging Vision Language Models and Symbolic Grounding for Video Question Answering

📝 Paper Summary

Video Question Answering (VideoQA) Symbolic Grounding Neuro-symbolic AI

SG-VLM enhances frozen Vision Language Models by generating and selecting symbolic scene graphs as intermediate reasoning steps to improve causal and temporal grounding in video QA.

Core Problem

Vision Language Models (VLMs) often rely on shallow correlations for VideoQA, leading to hallucinations and poor performance on tasks requiring multi-step temporal or causal reasoning.

Why it matters:

Current VLMs lack structural transparency and struggle to explicitly model object-centric interactions essential for complex questions.
Existing neuro-symbolic methods often require training separate heavy models or external tracking pipelines, making them computationally expensive and inflexible.
Long videos contain noise that distracts end-to-end models; intermediate grounding is needed to decompose reasoning.

Concrete Example: For a question 'Why does the brown cat watch the other cat?', a standard VLM might halluncinate based on background pixels. SG-VLM generates a graph containing (orange cat, watching, tabby cat) and (tabby cat, eating, food), enabling the answer 'waiting for its turn' by explicitly grounding the causal link.

Key Novelty

Modular Symbolic Scene Graph Grounding via Prompting

Uses frozen VLMs to generate scene graphs (objects + relations) via prompting, rather than training specialized graph generation networks.
Introduces a query-aware selection mechanism that filters scene graphs to only those relevant to the question, reducing noise from irrelevant frames.
Systematically evaluates four integration strategies (Full, Selection, Temporal Extension, Summary) to determine how symbolic data best aids VLMs.

Architecture

The 3-stage pipeline: (1) Scene Graph Generation via prompting, (2) Query-Aware Selection, (3) Grounded Answer Generation.

Evaluation Highlights

Surpasses ViperGPT baseline by +23.6% on NExT-QA (Temporal/Causal) using InternVL-14B backbone.
Achieves 76.9% accuracy on iVQA with InternVL-14B, significantly outperforming InstructBLIP (53.8%).
Outperforms strong end-to-end baselines like SeViLA and Flamingo on causal reasoning benchmarks, though gains over very large VLMs (Qwen-32B) are sometimes limited.

Breakthrough Assessment

7/10

Solid systematic study of symbolic grounding for modern VLMs. Shows strong improvements on specific reasoning types (causal/temporal) but acknowledges limitations where end-to-end VLMs are already strong.

⚙️ Technical Details

Problem Definition

Setting: Video Question Answering where a model M takes video V and question Q to predict answer A.

Inputs: Video V = {v1, ..., vl} and natural language question Q.

Outputs: Answer A (either open-ended text or selection from candidates).

Pipeline Flow

Frame Sampling (Difference-based)
Scene Graph Generation (Objects + Relations)
Scene Graph Selection (Query-aware)
Grounded Answer Generation

System Modules

Frame Sampler

Selects k representative frames based on visual differences to capture dynamics.

Model or implementation: Difference-based sampling algorithm

Object Identifier (SG Generation)

Identifies main and contextual objects in frames.

Model or implementation: Qwen2.5-VL or InternVL (Frozen VLM)

Relation Extractor (SG Generation)

Determines spatial and action relationships between identified objects.

Model or implementation: GroundingDINO + SAM + Metric3Dv2 (Spatial); VLM Prompting (Action)

Graph Selector

Selects relevant frames/graphs based on the question to reduce noise.

Model or implementation: VLM Prompting

Answer Generator

Generates final answer using frames and selected scene graphs.

Model or implementation: Qwen2.5-VL or InternVL (Frozen VLM)

Novel Architectural Elements

Prompt-based modular Scene Graph generation using the VLM itself (no external SG training)
Query-aware Scene Graph selection module to filter irrelevant symbolic noise before reasoning

Modeling

Base Model: Qwen2.5-VL (7B, 32B) and InternVL (8B, 14B)

📊 Experiments & Results

Evaluation Setup

Zero-shot / Modular evaluation on standard VideoQA benchmarks.

Benchmarks:

NExT-QA (Temporal and Causal Reasoning (Multiple Choice))
iVQA (Human-Object Interaction (Open-ended/MC))
ActivityNet-QA (Long-form Video QA (Open-ended))

Metrics:

Accuracy (%)
GPT-based Answer Similarity (for ActivityNet-QA)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against state-of-the-art baselines on NExT-QA showing significant improvements over modular and some end-to-end methods.
NExT-QA	Accuracy	60.0	83.6	+23.6
NExT-QA	Accuracy	63.6	83.6	+20.0
iVQA	Accuracy	53.8	76.9	+23.1
ActivityNet-QA	Accuracy	35.2	52.7	+17.5
Ablation study of Scene Graph integration strategies compared to VLM-only baseline.
NExT-QA	Accuracy	78.4	77.5	-0.9
iVQA	Accuracy	72.0	75.7	+3.7

Main Takeaways

Question-aware selection (FrameSel-SG) consistently outperforms using all scene graphs (Full-SG), proving that reducing symbolic noise is critical.
Symbolic grounding provides the largest gains in settings requiring fine-grained interaction tracking (iVQA) compared to broad long-horizon tasks.
Object-only summaries (Summary-SG) are often competitive with full relation graphs, suggesting that reliable object detection is currently more valuable than noisy relation extraction.
While SG-VLM outperforms prior baselines like SeViLA/ViperGPT, it sometimes underperforms strong end-to-end VLMs (like Qwen-32B) on NExT-QA, indicating a 'ceiling effect' where VLMs already capture sufficient context.

📚 Prerequisite Knowledge

Prerequisites

Vision Language Models (VLMs)
Scene Graphs
Zero-shot Prompting

Key Terms

Scene Graph (SG): A structured representation of an image or video frame consisting of nodes (objects) and edges (relationships like 'holding' or 'next to').

VLM: Vision Language Model—a multimodal AI that understands both images/video and text.

GroundingDINO: An open-set object detection model that finds objects in images based on text descriptions.

Segment Anything (SAM): A model that can generate segmentation masks for any object in an image.

NExT-QA: A VideoQA benchmark focused on explaining temporal and causal events (e.g., 'why', 'how').

iVQA: Interactive Video Question Answering dataset focusing on human-object interactions.

ActivityNet-QA: A dataset for long-form video understanding with open-ended questions.

Prompting: Providing specific text instructions to a frozen large language model to guide its output without updating its weights.