MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

📝 Paper Summary

3D Vision-Language Datasets 3D Visual Grounding 3D Question Answering 3D Large Language Models (3D-LLMs)

MMScan is a large-scale multi-modal 3D scene dataset constructed via a top-down, human-in-the-loop pipeline to provide hierarchical language annotations for training and benchmarking 3D perception models.

Core Problem

Existing 3D multi-modal datasets are limited to object-level understanding or specific tasks, lacking hierarchical scene structures (regions, inter-object relations) and scalable, high-quality annotations.

Why it matters:

Current 3D-LLMs are constrained to object-level tasks and struggle with holistic scene understanding due to data limitations.
Rule-based or purely manual annotations in prior works are either limited in scope (spatial relations only) or biased/unscalable.
Purely VLM-generated annotations often lack correctness and fine-grained grounding, leading to suboptimal training for embodied agents.

Concrete Example: A 3D-LLM trained on existing datasets might identify a 'chair' but fail to answer 'Which chair in the living room is nearest to the dining table?' because it lacks hierarchical region-level context and inter-target relationship data.

Key Novelty

Top-Down Hierarchical 3D-Language Annotation Pipeline

Decomposes 3D scenes from global regions down to individual objects, capturing spatial and attribute information at multiple granularities.
Uses a hybrid annotation workflow where VLMs (like GPT-4 and CogVLM) initialize captions based on optimal 2D views, which are then rigorously corrected by humans.
Retains explicit correspondence between text phrases and 3D entities (objects/regions), enabling the generation of diverse benchmark samples (QA, Grounding) from a single set of meta-annotations.

Architecture

The human-in-the-loop annotation pipeline. It shows the process from data selection to VLM initialization (using specific prompts and views) and finally human correction using a custom UI.

Evaluation Highlights

Fine-tuning LLaVA-Next-Vicuna-7B on MMScan captions achieves state-of-the-art 64.6% Accuracy on ScanQA (val), surpassing the previous best of 57.6%.
Training on MMScan improves 3D visual grounding performance by +7.17% AP on the ScanRefer benchmark compared to baselines.
Instruction tuning with MMScan data yields up to +25.6% accuracy improvement on the proposed MMScan-QA benchmark compared to base 3D-LLMs.

Breakthrough Assessment

9/10

Significantly scales up 3D-language data with a novel hierarchical approach. The massive improvement in downstream tasks (up to +25% in QA) and the release of a comprehensive benchmark make it a foundational resource.

⚙️ Technical Details

Problem Definition

Setting: 3D scene understanding involving Visual Grounding (VG) and Question Answering (QA)

Inputs: 3D scene point clouds/features and natural language queries

Outputs: Bounding boxes for specific targets (VG) or text responses to questions (QA)

Pipeline Flow

Data Preparation (EmbodiedScan)
Object-Level Annotation (VLM init + Human correction)
Region-Level Annotation (Segmentation + VLM init + Human correction)
Post-Processing (Generation of VG and QA samples)

System Modules

View Selector

Select optimal 2D image views of 3D objects/regions based on clarity (Laplacian variance) and visibility of surface points

Model or implementation: Rule-based selection

Object Captioner (Annotation)

Generate initial descriptions of object properties (shape, material, state)

Model or implementation: CogVLM / InternVL-Chat-V2 (ensemble)

Region Captioner (Annotation)

Generate descriptions for scene regions, anchoring objects with unique IDs

Model or implementation: GPT-4

Sample Generator

Convert meta-annotations into specific QA and VG benchmark samples

Model or implementation: ChatGPT

Novel Architectural Elements

Hierarchical top-down annotation logic: Region → Object → Attributes, unlike previous object-centric datasets
ID-anchored multi-view prompting: Overlaying 3D box IDs on 2D images to force VLMs (GPT-4) to ground descriptions to specific instances

Modeling

Base Model: LEO (for baselines) / LLaVA-Next-Vicuna-7B (for SOTA results)

Training Method: Instruction Tuning

Adaptation: Fine-tuning on MMScan captions and tasks

Trainable Parameters: Full model fine-tuning (implied for LLaVA-Next results)

Training Data:

1.4M meta-annotated captions
1.28M Visual Grounding samples
1.76M QA samples
97k grounded scene captions

Compute: Not reported in the paper

Comparison to Prior Work

vs. ScanRefer/ScanQA: MMScan is 10x larger and covers hierarchical scene context (regions, inter-object) rather than just object-level
vs. 3D-VisTA: MMScan uses a top-down human-in-the-loop pipeline for new annotations rather than aggregating/rewriting existing datasets
vs. SceneVerse: MMScan includes verified hierarchical language grounding (regions/objects) and human correction, ensuring higher quality than purely auto-generated datasets
+ 1 more
vs. EmbodiedScan: MMScan adds the language modality to the geometry-focused EmbodiedScan

Limitations

Visual grounding performance on MMScan is lower than on existing benchmarks, indicating the high difficulty of the new dataset.
Region-region relationship pairs are excluded from the current dataset due to complexity.
Reliance on GPT-4/VLMs for initial annotation might still introduce subtle biases despite human correction.

Reproducibility

Code: https://github.com/OpenRobotLab/EmbodiedScan

Datasets, benchmarks, and codes are publicly available at https://github.com/OpenRobotLab/EmbodiedScan. The paper details the prompts and VLM choices (CogVLM, InternVL, GPT-4) used for annotation generation. Evaluation baselines use existing models (LEO, 3D-LLM).

📊 Experiments & Results

Evaluation Setup

3D Visual Grounding and 3D Question Answering across multiple splits (Single-Target, Inter-Target)

Benchmarks:

MMScan-VG (3D Visual Grounding) [New]
MMScan-QA (3D Question Answering) [New]
ScanRefer (3D Visual Grounding)
ScanQA (3D Question Answering)

Metrics:

Accuracy@0.25 (for VG)
Accuracy@0.5 (for VG)
Exact Match (EM)
BLEU-4
ROUGE-L
METEOR
CIDEr
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ScanQA (val)	Accuracy (EM@1)	57.6	64.6	+7.0
ScanRefer	Acc@0.25	52.88	60.05	+7.17
MMScan-VG	Acc@0.25	0.0	30.07	+30.07
MMScan-QA	Accuracy	19.3	44.9	+25.6

Main Takeaways

Training with MMScan significantly boosts performance on existing benchmarks (ScanQA, ScanRefer), demonstrating the high quality and diversity of the annotations.
The new MMScan benchmarks are considerably more difficult than existing ones (e.g., lower absolute performance for current SOTA models), providing a challenging testbed for future research.
Hierarchical grounding (regions + objects) and instruction tuning are critical for enabling 3D-LLMs to handle complex spatial and attribute queries.

📚 Prerequisite Knowledge

Prerequisites

3D point cloud processing
Visual Language Models (VLMs)
Visual Grounding concepts
Large Language Models (LLMs) instruction tuning

Key Terms

Visual Grounding (VG): The task of locating a specific object or region in a scene based on a natural language description

Meta-annotations: Comprehensive, hierarchical descriptions of scene elements (objects/regions) used as a source to generate diverse task-specific samples

VLM: Vision-Language Model—AI models capable of understanding and generating text based on visual inputs (images)

BEV: Bird's Eye View—a top-down perspective of a 3D scene

ScanRefer: A standard benchmark dataset for 3D object localization using natural language

ScanQA: A benchmark dataset for question answering in 3D scenes

LEO: An embodied generalist agent capable of 3D vision-language tasks

AP: Average Precision—a metric for object detection/grounding accuracy

Acc@0.25: Accuracy metric where a prediction is correct if the Intersection over Union (IoU) with ground truth is > 0.25

Instruction Tuning: Fine-tuning LLMs on datasets formatted as instructions and responses to improve their ability to follow tasks