← Back to Paper List

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas J. Guibas, Fei Xia
Google DeepMind, Google Research
Computer Vision and Pattern Recognition (2024)
MM Reasoning Benchmark Pretraining

📝 Paper Summary

Vision-Language Models (VLMs) 3D Spatial Reasoning Synthetic Data Generation
SpatialVLM enhances vision-language models with quantitative spatial reasoning capabilities by training on a massive synthetic dataset generated from 2D internet images lifted into 3D metric space.
Core Problem
Current Vision-Language Models (VLMs) excel at semantic tasks but struggle with 3D spatial reasoning, such as estimating metric distances or comparing object sizes, because training data (image-caption pairs) lacks explicit 3D spatial information.
Why it matters:
  • Robotics applications require precise quantitative spatial understanding (e.g., 'can a 1-meter robot fit through this gap?') which standard VLMs cannot provide
  • Human-like reasoning requires innate spatial awareness without complex mental computation chains, a capability currently missing in foundation models
  • Lack of large-scale, high-quality 3D spatial VQA data limits the ability to train these capabilities directly
Concrete Example: When asked 'Can a 1-meter wide robot go through the path between the sofa and table?', a standard VLM like GPT-4V might refuse to answer or give a vague guess, whereas SpatialVLM estimates the path width is 1.56m and confirms the robot can pass.
Key Novelty
Automatic 3D Spatial VQA Data Generation Pipeline
  • Uses off-the-shelf vision experts (depth estimation, open-vocab detection, segmentation) to 'lift' 2D internet images into 3D point clouds with metric scale
  • Synthesizes 2 billion VQA pairs from 10 million images using templates based on the extracted 3D geometry (e.g., measuring distance between object centroids)
  • Trains a VLM (based on PaLM-E) on this synthetic data to learn direct spatial reasoning without requiring explicit 3D inputs at inference time
Evaluation Highlights
  • Outperforms GPT-4V on quantitative spatial questions: SpatialVLM outputs valid numbers 99.0% of the time vs 1.0% for GPT-4V
  • Achieves 75.2% accuracy on qualitative spatial binary predicates (e.g., 'is A left of B?'), surpassing GPT-4V (68.0%) and LLaVA-1.5 (71.3%)
  • Demonstrates robust distance estimation: 37.2% of answers fall within [50%, 200%] of ground truth, compared to 0.0% for GPT-4V and 13.0% for LLaVA-1.5
Breakthrough Assessment
8/10
Significant advance in unlocking quantitative spatial reasoning for VLMs using purely synthetic data from 2D images. Addresses a major blind spot of current SOTA models like GPT-4V.
×