Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

📝 Paper Summary

Spatial Understanding Multi-Frame Reasoning Robotics Perception

Multi-SpatialMLLM equips multi-modal models with multi-frame spatial reasoning capabilities by fine-tuning on a new 27-million-sample dataset (MultiSPA) automatically generated from 3D/4D scene scans.

Core Problem

Current MLLMs are trained primarily on single images and lack the spatial understanding required for robotics, failing to reason about depth, camera movement, or visual correspondence across multiple frames.

Why it matters:

Robotics and autonomous vehicles require understanding 3D space and motion from 2D video frames, not just static semantic description
Existing datasets rely on expensive manual annotation or noisy monocular estimators, limiting scale and quality
SOTA models (like GPT-4o) often hallucinate movement or confuse camera motion with object motion

Concrete Example: In a robotics scene where a camera moves around a static blue cube, GPT-4o incorrectly claims the cube moved 0.4 meters. Multi-SpatialMLLM correctly identifies that the cube is static and only the camera moved.

Key Novelty

MultiSPA Data Engine & Dataset

Leverages existing 3D (ScanNet) and 4D (TAPVid3D) datasets to automatically generate high-quality QA pairs without human annotation
Projects 3D point clouds into 2D image pairs to create ground-truth labels for depth, correspondence, and displacement vectors
Defines a comprehensive set of 5 spatial tasks (e.g., depth perception, visual correspondence) with diverse output formats (coordinates, vectors, scalars)

Architecture

Overview of Multi-SpatialMLLM capabilities and input/output formats. Shows the model accepting multiple image frames and diverse referencing (coordinates, dots) to produce outputs like vectors and scalars.

Evaluation Highlights

Achieves 56.11% average accuracy on the MultiSPA benchmark, outperforming GPT-4o (28.87%) and Gemini-2.0 (30.31%)
Attains 18.00% accuracy on challenging camera movement vector prediction where baselines (GPT-4o, InternVL) achieve near 0% due to task difficulty
Demonstrates strong zero-shot generalization on the external BLINK benchmark, improving Visual Correspondence accuracy from 39.0% (base model) to 89.5%

Breakthrough Assessment

8/10

Significant contribution in data engineering (27M samples) that unlocks a new capability (multi-frame spatial reasoning) in MLLMs, addressing a major gap for embodied AI.

⚙️ Technical Details

Problem Definition

Setting: Multi-frame spatial reasoning where a model inputs multiple images and a text query to predict spatial properties

Inputs: Pair of images (I_i, I_j) and a text prompt/question

Outputs: Spatial answers in various formats: Qualitative (text), Scalar (distance), Coordinate (pixel [u,v]), or Vector (3D displacement [x,y,z])

Pipeline Flow

Input Processing (Images + Text)
Vision Encoding (InternViT)
Projection (MLP)
LLM Inference (InternLM2)
Output Generation (Text/Coordinates/Vectors)

System Modules

Vision Encoder

Extract visual features from input frames

Model or implementation: InternViT-6B (from InternVL2)

Large Language Model

Process visual tokens and text instructions to generate spatial answers

Model or implementation: InternLM2-20B (for InternVL2-26B variant) or smaller variants

Modeling

Base Model: InternVL2-8B (primary), InternVL2-13B, InternVL2-26B

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Objective Functions:

Purpose: Standard language modeling.

Formally: Next-token prediction loss maximizing likelihood of ground truth text tokens.

Adaptation: LoRA (rank=16)

Trainable Parameters: LLM backbone adapters (Image encoder and projector frozen)

Training Data:

MultiSPA Dataset: 27M+ samples generated from ScanNet (static scenes) and TAPVid3D (dynamic scenes)
Data includes 5 tasks: Depth Perception, Visual Correspondence, Camera Movement, Object Movement, Object Size
3M subset used for training due to compute constraints, mixed with 60K general VQA data

Key Hyperparameters:

learning_rate: 4e-5
batch_size: 192
epochs: 1
+ 2 more
optimizer: AdamW
scheduler: Cosine

Compute: 24 nodes of 8x32G V100 GPUs (192 GPUs total), ~50 hours training time

Comparison to Prior Work

vs. SpatialVLM: Multi-SpatialMLLM handles multi-frame reasoning (camera/object motion) vs. single-frame spatial relationships
vs. SpatialRGPT: Supports point/coordinate referencing without requiring object masks; handles dynamic scenes
vs. SAT [not cited in paper]: SAT relies on simulated data, while Multi-SpatialMLLM uses real-world scans (ScanNet/TAPVid3D) to reduce sim-to-real gap

Limitations

Performance on quantitative vector estimation remains low (18%) despite improvements over baseline (0%)
Most experiments focus on two-view scenarios, though the pipeline supports N-frames
Requires massive compute (192 GPUs) for training, limiting accessibility for reproduction
Emergent capabilities (learning from 'Hard' samples) observed only in larger (26B) models

Reproducibility

Code: https://runsenxu.com/projects/Multi-SpatialMLLM

Code and data are publicly available. Training used a 3M subset of the full 27M dataset. Source datasets (ScanNet, TAPVid3D) are public.

📊 Experiments & Results

Evaluation Setup

Evaluation on MultiSPA Benchmark (7,800 held-out samples) and BLINK Benchmark (zero-shot)

Benchmarks:

MultiSPA Benchmark (Multi-frame spatial reasoning (5 subtasks)) [New]
BLINK (General multi-modal perception (Visual Correspondence, Relative Depth))

Metrics:

Accuracy (Qualitative/MCQ)
Accuracy (Scalar/Vector: within 20% L2 norm error)
Accuracy (Coordinates: within 5% image width)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Multi-SpatialMLLM consistently outperforms base InternVL models and proprietary SOTA models across diverse spatial tasks in the MultiSPA benchmark.
MultiSPA	Average Accuracy	28.87	56.11	+27.24
MultiSPA (Camera Translation)	Vector Accuracy	0.00	18.00	+18.00
MultiSPA (Visual Correspondence)	Coordinate Accuracy	1.67	49.00	+47.33
MultiSPA (Object Movement)	Vector Accuracy	5.25	12.92	+7.67
BLINK	Visual Correspondence Accuracy	39.0	89.5	+50.5
MultiSPA (Camera Vector)	Accuracy	9.30	18.00	+8.70

Experiment Figures

Scalability of Multi-SpatialMLLM on the Camera Movement Vector task as training data size increases.

Qualitative demonstration on real-world robotics data (out-of-distribution) showing the model acting as a multi-frame reward annotator.

Main Takeaways

Current MLLMs (including GPT-4o) have near-zero capability in quantitative spatial reasoning (vectors, coordinates) without specific fine-tuning
The proposed data engine enables effective learning of 3D concepts from 2D images, generalizing to unseen benchmarks like BLINK
Scaling data size and model capacity consistently improves performance, with 26B models showing emergent capabilities on 'Hard' correspondence tasks where smaller models fail
Multi-task training provides synergistic benefits, where learning depth and correspondence helps improve performance on camera movement estimation

📚 Prerequisite Knowledge

Prerequisites

Multi-modal Large Language Models (MLLMs)
Structure-from-Motion (SfM) concepts
3D Coordinate Systems (Camera Extrinsics/Intrinsics)

Key Terms

MLLM: Multi-modal Large Language Model—an AI that processes both text and images

Structure-from-Motion: A technique to reconstruct 3D structure and camera poses from a sequence of 2D images

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter layers

IoU: Intersection over Union—a metric used here to measure the overlap of visible 3D points between two 2D image frames

TAPVid3D: A dataset providing temporally aligned 3D point tracking, used as a source for generating object movement data

ScanNet: A large-scale dataset of 3D indoor scenes with reconstructed point clouds and camera poses

Visual Correspondence: The task of identifying which pixel in a second image corresponds to the same physical 3D point as a pixel in the first image