← Back to Paper List

Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics

Wenrui Xu, D. Lyu, Weihang Wang, J. Feng, Chen Gao, Yong Li
School of Architecture, Department of Electronic Engineering, BNRist, Tsinghua University
Annual Meeting of the Association for Computational Linguistics (2025)
MM Benchmark Reasoning

📝 Paper Summary

Visual Language Models (VLMs) Spatial Intelligence Psychometrics
This paper establishes a psychometric framework to evaluate five basic spatial abilities in VLMs, revealing that models significantly lag behind humans and lack dynamic 3D mental simulation capabilities.
Core Problem
Current VLM evaluations lack theoretical grounding, often testing isolated tasks without a comprehensive framework, and fail to benchmark against human performance hierarchies.
Why it matters:
  • Essential for embodied AI applications like visual navigation and robotics which require human-like spatial understanding
  • Existing benchmarks often conflate spatial reasoning with other capabilities (e.g., planning) or omit critical skills like mental rotation
  • The gap between AI and human spatial cognition remains unquantified due to the lack of standardized psychometric comparisons
Concrete Example: A VLM might describe a static indoor scene correctly (spatial perception) but fail to identify which of four rotated 3D block figures matches a target figure (mental rotation), a task humans solve by mentally simulating the rotation.
Key Novelty
Psychometric Basic Spatial Abilities (BSA) Framework for VLMs
  • Adapts Gardner's Theory of Multiple Intelligences to decompose VLM spatial intelligence into five distinct, measurable sub-skills (Perception, Relation, Orientation, Rotation, Visualization)
  • Benchmarks VLMs using nine standardized human psychometric tests (e.g., Mental Rotation Test, Paper Folding), enabling direct human-AI performance comparison
Evaluation Highlights
  • VLMs average 24.95% accuracy across spatial tasks, significantly underperforming the human average of 68.38%
  • Small models like Qwen2-VL-7B (30.82%) outperform larger commercial models (e.g., InternVL2 at 19.6%), defying typical scaling laws for spatial tasks
  • Intervention using 5-shot learning improves accuracy by +25.9 percentage points but plateaus, suggesting fundamental architectural limits in dynamic simulation
Breakthrough Assessment
8/10
Provides a much-needed theoretical foundation for spatial AI evaluation. The finding that scaling laws fail for spatial reasoning is a significant insight.
×