SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

📝 Paper Summary

Spatial reasoning in Vision Language Models (VLMs) Perspective taking and mental rotation

SpinBench is a diagnostic benchmark that evaluates VLMs' spatial reasoning by decomposing perspective taking into progressively harder tasks like rotation, identity matching, and relative pose estimation.

Core Problem

Current VLMs demonstrate apparent spatial skills in end-to-end tasks, but it is unclear if they possess genuine geometric understanding or rely on shallow pattern matching and dataset biases.

Why it matters:

Failures in basic spatial primitives (like rotation or viewpoint change) undermine reliability in embodied applications like robotics, navigation, and physical commonsense reasoning
Existing benchmarks often entangle spatial reasoning with high-level planning or language, masking specific deficits in mental simulation or frame-of-reference handling
Prior work lacks controlled variation to distinguish between visual perception errors and linguistic reasoning failures

Concrete Example: In a dynamic rotation task, when a person turns left (from their own perspective), models often incorrectly predict 'right' because they default to the viewer's perspective, failing to switch frames of reference even when explicitly prompted.

Key Novelty

Cognitively Grounded Diagnostic Benchmark for Spatial Reasoning

Decomposes complex perspective-taking into seven atomic diagnostic categories (e.g., identity matching, mental rotation, dynamic translation) to pinpoint specific failure modes
Systematically controls for frame-of-reference (viewer-centric vs. object-centric) and applies logical augmentations (symmetry, syntax) to test reasoning consistency
Designed with a progressive structure where success on simpler single-object tasks is a prerequisite for complex multi-object scene reasoning

Architecture

Overview of SpinBench task design, illustrating the 7 diagnostic categories and examples of visual inputs for each.

Evaluation Highlights

Proprietary models like GPT-5 achieve high consistency (97.1%) and accuracy, but most open-source models perform near chance on mental rotation and perspective taking
Strong egocentric bias detected: models excel at viewer-centric rotation (e.g., 0.94 kappa for Gemini 2.5 Pro) but fail dramatically on allocentric variants (-0.66 kappa)
Human response time correlates strongly with VLM accuracy (r = -0.54), validating that the benchmark captures genuine spatial difficulty shared by humans and models

Breakthrough Assessment

9/10

A rigorously designed diagnostic tool that exposes fundamental gaps in spatial intelligence. By isolating specific cognitive primitives like rotation and perspective taking, it moves beyond aggregate metrics to explain *why* models fail.

⚙️ Technical Details

Problem Definition

Setting: Multi-choice Visual Question Answering (VQA) focusing on spatial transformations

Inputs: One or more images (single view, sequential frames, or multi-view candidates) and a natural language question defining a spatial task

Outputs: Selection of the correct option (A, B, C, D) corresponding to a spatial relation, identity, or view

Pipeline Flow

Task Sampling (from 7 categories)
Visual Input Processing (Single/Multi-image)
VLM Inference (Zero-shot or CoT)
Evaluation (Accuracy, Kappa, Consistency)

System Modules

Task Generator

Generates questions across 7 categories with controlled variations (symmetry, syntax, premise presence)

Model or implementation: Procedural generation scripts

VLM

Performs spatial reasoning on provided images and text prompts

Model or implementation: Various (e.g., GPT-4o, InternVL3, Qwen2-VL)

Novel Architectural Elements

Diagnostic decomposition of spatial reasoning into 7 progressive categories (Identity → Grounding → Dynamic → Rotation → Perspective Taking)
Controlled augmentation pipeline testing symmetry (left/right flips) and syntactic variation (question rephrasing) to measure reasoning consistency

Modeling

Base Model: Evaluated 43 models including GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, InternVL family, Qwen-VL family

Comparison to Prior Work

vs. CLEVR: Uses photorealistic and real-world data (Faces, Cars, ABO) in addition to synthetic scenes
vs. BLINK: Focuses specifically on high-level spatial cognition (mental rotation, perspective) rather than general perception
vs. MindCube: Includes diverse object categories and multi-object scenes, not just single puzzles
+ 2 more
vs. SpaCE-10: Introduces fine-grained control over frames of reference (allocentric vs. egocentric) and consistency checks via symmetry/syntax augmentations
vs. 3DSR-Bench [not cited in paper]: 3DSR-Bench focuses on 3D spatial relationships in point clouds/RGB-D; SpinBench focuses on inferring 3D spatial properties (rotation, perspective) from 2D RGB images

Limitations

Restricted to horizontal 2D plane variations; vertical relations and complex 3D trajectories are excluded
Evaluation relies on multiple-choice format, which may allow for guessing strategies (mitigated by Kappa metric)
Focuses on static or discrete step-based reasoning rather than continuous video-based spatial tracking
Does not evaluate physical interaction or manipulation capabilities directly

Reproducibility

Code: https://spinbench25.github.io/

📊 Experiments & Results

Evaluation Setup

Zero-shot visual question answering on the SpinBench dataset

Benchmarks:

SpinBench (Spatial Reasoning VQA) [New]

Metrics:

Raw Accuracy
Cohen's Kappa (chance-corrected accuracy)
Pairwise Consistency (percentage of consistent answers across symmetric/syntactic variations)
Statistical methodology: Pearson and Spearman correlations reported for analysis of human-model agreement and accuracy-consistency relationships

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall model rankings show proprietary models leading, with a strong correlation between accuracy and consistency.
SpinBench	Accuracy	91.2	78.5	-12.7
SpinBench	Pairwise Consistency	Not reported in the paper	97.1	Not reported in the paper
Frame of reference analysis reveals severe egocentric bias in most models.
SpinBench (Face Rotation)	Cohen's Kappa	0.94	-0.66	-1.60
SpinBench (Face Rotation)	Cohen's Kappa	0.63	0.55	-0.08
Chain-of-Thought (CoT) prompting improves performance on complex transformation tasks but not on basic perception.
SpinBench (Perspective Taking)	Cohen's Kappa	0.000	0.600	+0.600

Experiment Figures

Heatmap of Cohen's Kappa scores for 43 models across all task categories, revealing performance clusters and difficulty gradients.

Main Takeaways

Performance hierarchy: Models perform best on static grounding, struggle with rotation, and fail most significantly on perspective taking
Egocentric bias: Models default to viewer-centric interpretations and struggle to adopt object-centric frames of reference (e.g., a person's 'left')
Consistency indicates competence: High accuracy models also show high consistency across symmetric/syntactic variations; low consistency implies guessing
Reasoning vs. Perception: Premise-based tasks show that failures are not just visual; models often fail to reason spatially even when given the correct symbolic premises

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Spatial reasoning concepts (mental rotation, perspective taking)
Frames of reference (egocentric vs. allocentric)

Key Terms

perspective taking: The cognitive ability to reason about how a scene or object arrangement appears from a different viewpoint

mental rotation: The ability to mentally simulate the rotation of an object to understand its orientation in a new state

allocentric: Object-centered frame of reference (e.g., 'left of the car'), independent of the viewer's position

egocentric: Viewer-centered frame of reference (e.g., 'left of me'), dependent on the observer's viewpoint

canonical view: Standard, typical viewpoints of an object (e.g., front, side, back) that represent its identity most clearly

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer

Cohen's kappa: A statistical metric (κ) that measures inter-rater agreement or accuracy while correcting for chance agreement, useful for multiple-choice tasks with varying option counts