โ† Back to Paper List

STAR: A Benchmark for Situated Reasoning in Real-World Videos

Bo Wu, Shoubin Yu
MIT-IBM Watson AI Lab, Shanghai Jiao Tong University
NeurIPS Datasets and Benchmarks (2024)
MM Reasoning Benchmark QA

๐Ÿ“ Paper Summary

Visual Reasoning Video Question Answering
STAR is a diagnostic benchmark for situated reasoning in real-world videos that evaluates systems on interaction, sequence, prediction, and feasibility questions using structured situation hypergraphs and functional programs.
Core Problem
Existing video reasoning models struggle to capture dynamic knowledge from real-world situations and perform logical reasoning, often relying on shortcuts between visual content and answers rather than true understanding.
Why it matters:
  • Reasoning in the real world is not divorced from situations; intelligent systems must capture present knowledge from surroundings to act feasibly.
  • Formal logic frameworks (e.g., situation calculus) are impractical for real scenarios due to the impossibility of defining all rules manually.
  • Current synthetic video benchmarks may not represent the complexity and noise of real-world daily activities.
Concrete Example: In a video where a person is holding a towel, a model might correctly identify the action 'holding towel' but fail to predict 'wipe hands' as the next likely action or determine if 'opening the door' is feasible given the current state, whereas humans do this subconsciously.
Key Novelty
Situated Reasoning Benchmark (STAR)
  • Constructs a dataset grounded in real-world videos (Charades) but annotated with structured 'situation hypergraphs' that abstract entities, relations, and actions.
  • Generates four distinct question types (interaction, sequence, prediction, feasibility) via functional programs that map logic to the hypergraph structure.
  • Proposes a diagnostic Neuro-Symbolic Situated Reasoning (NS-SR) model that explicitly separates visual perception, situation abstraction, and symbolic reasoning.
Architecture
Architecture Figure Figure 1
An overview of the Situated Reasoning framework. It shows a real-world video of a person interacting with objects, the abstraction into a Situation Hypergraph (nodes for Person, Towel, Door; hyperedges for actions), and the question answering process using a functional program.
Evaluation Highlights
  • State-of-the-art video QA models (e.g., ClipBERT) achieve relatively low accuracy on STAR, often struggling with Feasibility (39.23%) and Prediction (42.06%) questions.
  • The proposed diagnostic model (NS-SR) outperforms pure neural baselines significantly, achieving roughly +15-20% accuracy improvements on interaction and sequence tasks compared to standard QA models.
  • Human performance on the benchmark is high (Average ~92%), highlighting a significant gap between current machine intelligence and human situated reasoning.
Breakthrough Assessment
8/10
Significant contribution in bridging the gap between synthetic reasoning benchmarks and real-world video understanding. The structured hypergraph approach provides a rigorous diagnostic tool for neuro-symbolic methods.
×