Can Vision-Language Models Solve the Shell Game?

📝 Paper Summary

Visual Entity Tracking Video Understanding Benchmarks Reasoning with Vision-Language Models

Current VLMs fail at visual entity tracking when appearance shortcuts are removed, but they can solve the task by generating explicit spatiotemporal trajectories as intermediate reasoning steps.

Core Problem

State-of-the-art Video VLMs perform near random chance on visual entity tracking tasks (like the shell game) when static appearance cues are removed, revealing a fundamental inability to maintain object permanence over time.

Why it matters:

Existing benchmarks (e.g., Perception Test) inflate performance via visual shortcuts (e.g., distinct cups), masking deep deficits in fine-grained temporal perception
Visual entity tracking is a prerequisite for embodied AI and game-playing agents, yet fixed-depth transformers are theoretically limited in solving it without intermediate computation
Current models hallucinate motion events or collapse complex shuffling into coarse descriptions, failing to track identical objects through occlusion and position swaps

Concrete Example: In a shell game video where a ball is hidden under one of three identical cups that swap positions, Gemini-3-Pro correctly identifies the start but hallucinates a sequence of swaps that never occurred, resulting in a random final guess.

Key Novelty

Spatiotemporal Grounded Chain-of-Thought (SGCoT) for Visual Tracking

Diagnose tracking failures using VET-Bench, a synthetic benchmark with visually identical objects that forces reliance on motion continuity rather than appearance re-identification
Prove theoretically that visual entity tracking is NC1-complete, meaning fixed-depth transformers cannot solve it generally without intermediate reasoning steps
Transform perception into reasoning by fine-tuning Molmo2 to explicitly output object coordinates at fixed timestamps (SGCoT) before predicting the final answer

Architecture

Overview of VET-Bench and the proposed SGCoT method compared to standard VLM outputs

Evaluation Highlights

Molmo2-SGCoT achieves >90% accuracy on VET-Bench, surpassing state-of-the-art models like Gemini-3-Pro (~37%) which perform near random chance
Frontier models (Gemini-3-Pro, Qwen3-VL) drop to ~30-36% accuracy (near random guess of 33%) on a filtered subset of the Perception Test when visual shortcuts are removed
Direct-answer training fails: Qwen2.5-VL remains at random chance even after 60 epochs of supervision on VET-Bench without CoT

Breakthrough Assessment

9/10

Exposes a critical, masked failure mode in SOTA VLMs with a rigorous diagnostic benchmark and theoretical proof, then provides a highly effective solution that jumps from random chance to >90% accuracy.

⚙️ Technical Details

Problem Definition

Setting: Visual Entity Tracking (TRACK k): Tracking k visually indistinguishable objects in a video V over T frames.

Inputs: Video sequence V = {F_0, ..., F_T} and a query specifying a target object at t=0.

Outputs: The terminal index π(i) of the target object in the final frame.

Pipeline Flow

Visual Input Processing (Frame Sampling)
Spatiotemporal Grounded CoT Generation (Trajectory Prediction)
Final Answer Generation

System Modules

Visual Encoder

Process video frames into visual embeddings

Model or implementation: Molmo2 (Vision Encoder)

SGCoT Generator

Generate explicit trajectory of the target object across timestamps

Model or implementation: Molmo2 (Fine-tuned)

Answer Generator

Predict final position based on the generated trajectory

Model or implementation: Molmo2 (Fine-tuned)

Novel Architectural Elements

SGCoT Alignment: A fine-tuning strategy that forces the VLM to output dense spatiotemporal coordinates (tracking data) as an intermediate CoT step before answering
Integration of low-level point tracking (usually a tool output) directly into the autoregressive generation stream as reasoning

Modeling

Base Model: Molmo2 (Open-weights VLM)

Training Method: Supervised Fine-Tuning (Alignment) on text-only synthetic data

Objective Functions:

Purpose: Maintain tracking grounding while learning to infer the answer from the trajectory.

Formally: Standard cross-entropy loss on the Final Answer tokens, while masking loss on the synthesized trajectory tokens.

Training Data:

300 synthetic text-only samples
Each sample contains a synthesized <tracks> trajectory (generated by script) and the corresponding ground truth answer
No actual video data used during this alignment phase (leveraging Molmo2's pre-trained grounding)

Key Hyperparameters:

epochs: 1
batch_size: Not reported in the paper
learning_rate: Not reported in the paper

Compute: Fine-tuning takes 3 minutes on a single A100 GPU

Comparison to Prior Work

vs. Perception Test: VET-Bench uses identical objects to enforce temporal tracking, removing appearance shortcuts
vs. VideoReasonBench: Requires implicit motion perception rather than reasoning over explicit visual symbols/arrows
vs. GCoT: Extends grounding to the temporal domain (trajectories) rather than just static spatial boxes
+ 1 more
vs. MET-Bench [not cited in paper]: VET-Bench focuses specifically on video inputs and motion continuity, whereas MET-Bench includes static image/text tracking tasks

Limitations

SGCoT currently relies on the object being the only moving entity or requires unambiguous referring expressions; complex multi-object interactions might confuse the tracker
Analysis assumes strict localization and continuity conditions (no motion blur or complex occlusions beyond the shell game mechanics)
Performance depends heavily on the base model's (Molmo2) pre-trained point-tracking capabilities

Reproducibility

Code: https://vetbench.github.io

Code and data available at https://vetbench.github.io. The method uses synthetic text-only data for alignment, making it highly reproducible. Base model Molmo2 is open weights.

📊 Experiments & Results

Evaluation Setup

Multiple-choice QA on video sequences involving object shuffling (Shell Game / Cards Game)

Benchmarks:

VET-Bench (Visual Entity Tracking (synthetic)) [New]
Perception Test (Filtered) (Real-world object tracking (subset))

Metrics:

Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on VET-Bench shows that all baseline models perform near random chance (0.33), while Molmo2-SGCoT achieves near-perfect performance.
VET-Bench	Accuracy	0.37	0.91	+0.54
VET-Bench	Accuracy	0.35	0.91	+0.56
VET-Bench	Accuracy	0.34	0.91	+0.57
Analysis on the Perception Test shows performance drops drastically when visual shortcuts are removed.
Perception Test (Full)	Accuracy	0.33	0.80	+0.47
Perception Test (Filtered Subset)	Accuracy	0.33	0.31	-0.02

Experiment Figures

Bar chart comparing accuracy of various VLMs on VET-Bench Cup Game and Card Game

Accuracy vs. Swap Count and Object Count

Main Takeaways

Current VLMs rely on static appearance cues; when these are removed (VET-Bench or Filtered Perception Test), performance collapses to random chance.
Symbolic reasoning (standard CoT) is insufficient if the underlying perceptual grounding is flawed; models hallucinate swaps or misidentify moving objects.
Visual entity tracking is computationally hard (NC1-complete) for transformers without intermediate steps; direct-answer training fails to learn the task.
Explicitly generating spatiotemporal trajectories (SGCoT) bridges the gap between perception and reasoning, solving the task effectively.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Vision-Language Models (VLMs) and video processing
Understanding of Chain-of-Thought (CoT) prompting
Basic computational complexity theory (NC1, TC0)

Key Terms

NC1-complete: A complexity class of problems solvable by logarithmic-depth circuits; implies that fixed-depth transformers (which are in TC0) generally cannot solve these problems without intermediate steps

TC0: A complexity class containing problems solvable by constant-depth circuits with majority gates; standard transformers are theoretically limited to this class

Shell Game: A gambling game where a ball is hidden under one of three cups which are then shuffled; used here as a proxy for object permanence and tracking tasks

SGCoT: Spatiotemporal Grounded Chain-of-Thought—a reasoning strategy where the model generates explicit object coordinates and timestamps before the final answer

VET-Bench: Visual Entity Tracking Benchmark—a synthetic dataset designed to test tracking of visually identical objects, removing appearance-based shortcuts

Nyquist criterion: A sampling principle; here applied to video to ensure frame rate is high enough (2 frames per swap) to resolve object motion unambiguously

Three-Card Monte: A card game similar to the shell game requiring the tracking of a specific card (e.g., Queen of Hearts) after shuffling

S5: The symmetric group of degree 5; its word problem is NC1-complete, serving as the basis for the paper's hardness proof