3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

📝 Paper Summary

3D Vision-Language Modeling Embodied AI Instruction Tuning

3D-GRAND introduces a massive dataset of 6.2 million densely-grounded 3D-text pairs and a new hallucination benchmark, demonstrating that large-scale synthetic data scaling significantly improves 3D-LLM grounding and reduces hallucination.

Core Problem

Existing 3D-LLM datasets are small-scale, lack dense grounding (associating every noun phrase with 3D objects), and current benchmarks fail to systematically evaluate object hallucination in 3D models.

Why it matters:

Without dense grounding, robots and agents cannot reliably connect abstract language instructions to physical objects, leading to navigation and manipulation failures.
Scarcity of 3D-text pairs limits 3D-LLMs compared to their 2D counterparts, which benefit from billion-scale datasets.
Hallucination in 3D-LLMs is largely unexplored and unmeasured, undermining trust in embodied agents.

Concrete Example: In SceneVerse (prior work), a sentence like 'This is a big cotton sofa between the window and the table' is grounded only to the sofa. 3D-GRAND grounds 'sofa', 'window', and 'table' to specific 3D objects, preventing ambiguity.

Key Novelty

Scaling Densely-Grounded Synthetic 3D Data

Leverage synthetic 3D scene generation pipelines (3D-FRONT, Structured3D) to create large-scale environments without expensive real-world scanning.
Use a pipeline involving 2D-LLMs (GPT-4V) and scene graphs to automatically generate 6.2 million instruction-following pairs where every noun phrase is linked to a specific 3D object ID.
Introduce a polling-based evaluation protocol (3D-POPE) specifically designed to probe whether 3D-LLMs hallucinate non-existent objects.

Architecture

The data generation pipeline for 3D-GRAND.

Evaluation Highlights

Outperforms the previous state-of-the-art 3D-LLM by +7.7% on Accuracy@0.25IoU on the ScanRefer benchmark, despite training only on synthetic data (zero-shot transfer to real ScanNet scenes).
Achieves 93.34% Precision on the 3D-POPE hallucination benchmark (Random sampling), significantly reducing object hallucination compared to baselines.
Data scaling analysis shows consistent performance improvements when increasing training data from 10% to 100% of the dataset.

Breakthrough Assessment

9/10

The dataset scale (6.2M densely grounded pairs) is a massive leap over prior work. The demonstration of effective sim-to-real transfer for 3D grounding is highly significant for the field.

⚙️ Technical Details

Problem Definition

Setting: 3D Vision-Language Understanding and Grounding

Inputs: 3D scene (point cloud/features), user text instruction/query

Outputs: Text response and/or 3D object identifiers (bounding boxes) corresponding to mentioned objects

Pipeline Flow

Data Generation Pipeline: Synthetic Scene → 2D Projection → GPT-4V Attribute Extraction → Scene Graph → GPT-4 Instruction Generation → Filtering
Inference Pipeline: 3D Scene + Text → Feature Extraction → LLM Processing → Text/Box Output

System Modules

Synthetic Scene Generator (Data Creation)

Source 40k+ 3D indoor scenes from 3D-FRONT and Structured3D

Model or implementation: 3D-FRONT / Structured3D assets

Annotation Generator (Data Creation)

Generate densely grounded text descriptions from scene graphs

Model or implementation: GPT-4 / GPT-4V

3D-LLM (Model)

Process 3D features and text to generate grounded responses

Model or implementation: Llama-2-7B (fine-tuned)

Novel Architectural Elements

Integration of a specialized 'hallucination filter' in the data generation pipeline to verify object existence before training

Modeling

Base Model: Llama-2

Training Method: Supervised Instruction Tuning with LoRA

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters (exact count not reported)

Training Data:

40,087 synthetic scenes
6.2 million densely-grounded instructions

Key Hyperparameters:

learning_rate: 2e-4
batch_size: 96
weight_decay: 0.01
+ 1 more
epochs: Not reported (trained for 10k steps)

Compute: 12 NVIDIA A40 GPUs, approx. 48 hours for 10k steps

Comparison to Prior Work

vs. SceneVerse: 3D-GRAND uses dense phrase-to-object grounding vs. sparse grounding.
vs. ScanRefer: 3D-GRAND is 100x larger and synthetic, enabling massive scaling.
vs. 3D-LLM: 3D-GRAND trains on synthetic data and transfers to real (ScanNet) zero-shot, whereas 3D-LLM trains on real data features.

Limitations

Relies on synthetic data (sim-to-real gap may still exist for complex textures/physics).
Annotation quality depends on GPT-4/GPT-4V performance (though human verification showed low error rates).
Bounding box proposals (Mask3D) during inference are a bottleneck if they fail to detect objects.

Reproducibility

Code: https://github.com/3D-GRAND/3D-GRAND

Dataset (3D-GRAND) and benchmark (3D-POPE) are publicly available. Code is available at https://github.com/3D-GRAND/3D-GRAND. The model uses Llama-2 and Mask3D (publicly available).

📊 Experiments & Results

Evaluation Setup

Zero-shot transfer to real-world 3D scenes (ScanNet/ScanRefer) and hallucination probing.

Benchmarks:

ScanRefer (3D Visual Grounding)
3D-POPE (Object Hallucination Probing) [New]

Metrics:

Accuracy@0.25IoU
Accuracy@0.5IoU
Precision
Recall
F1 Score
Yes (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ScanRefer	Acc@0.25	39.1	46.8	+7.7
3D-POPE (Random Sampling)	Precision	50.77	93.34	+42.57
3D-POPE (Random Sampling)	Accuracy	50.41	89.12	+38.71
3D-POPE	Precision	78.02	93.34	+15.32

Experiment Figures

Examples of the three task categories in 3D-GRAND: 3D-Grounded Object Reference, Scene Description, and QA.

Main Takeaways

Sim-to-Real Transfer: Models trained purely on large-scale synthetic data (3D-GRAND) can outperform models trained on real data (ScanNet) for grounding tasks.
Grounding Reduces Hallucination: Densely grounding text to objects significantly reduces the model's tendency to hallucinate non-existent objects compared to ungrounded or sparsely grounded baselines.
Scaling Laws: Performance on grounding tasks improves consistently with the size of the densely-grounded dataset, suggesting value in further scaling synthetic data generation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Vision-Language Models (VLMs)
Basics of 3D point cloud processing
Concept of 'grounding' in AI (linking text to physical entities)

Key Terms

Grounding: Connecting linguistic terms (e.g., 'red chair') to specific physical objects or coordinates in a 3D scene.

Hallucination: When a model generates text describing objects or attributes that do not actually exist in the input scene.

3D-LLM: A Large Language Model adapted to take 3D spatial data (like point clouds) as input alongside text.

Sim-to-Real Transfer: Training a model on simulated/synthetic data and successfully applying it to real-world data without retraining.

ScanNet: A popular real-world dataset of 3D indoor scenes used for benchmarking.

IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box.

Dense Grounding: Associating every relevant noun phrase in a sentence with a specific object in the scene, rather than just the main subject.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique for LLMs.

ZeRO-2: A memory optimization technique for distributed training of large models.

FlashAttention: An algorithm that speeds up attention computation in Transformers while reducing memory usage.

Visual Grounding: The task of locating an object in an image or 3D scene based on a natural language description.