ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning

📝 Paper Summary

3D Semantic Mapping Robot Perception and Planning Open-Vocabulary Scene Understanding

ConceptGraphs constructs 3D scene graphs by fusing 2D segmentation masks into 3D objects, captioning them with vision-language models, and inferring relationships via LLMs to enable open-vocabulary planning.

Core Problem

Existing 3D representations using foundation models produce dense, unstructured per-point features that are memory-inefficient and lack the object-level semantic relationships required for complex planning tasks.

Why it matters:

Robots need to understand abstract queries (e.g., 'find something to sit on') which requires object-level reasoning rather than just geometric reconstruction
Dense feature maps scale poorly to large environments and are difficult to update dynamically
Current 3D scene graphs are typically closed-vocabulary, limiting robots to detecting only a predefined set of object categories trained offline

Concrete Example: When asked 'My wrist hurts... Anything to help?', a standard object detector might fail if 'wrist brace' isn't in its training set. ConceptGraphs identifies a 'power drill' (via reasoning about the wrist pain) or a 'medical kit' by leveraging LLM knowledge grounded in the map.

Key Novelty

Object-Centric 3D Mapping with LLM-Inferred Edges

Replaces dense feature clouds with a graph of discrete object nodes, created by fusing class-agnostic 2D segmentation masks into 3D instances
Uses Large Vision-Language Models (LVLMs) to generate descriptive captions for each 3D object node instead of simple class labels
Leverages Large Language Models (LLMs) to reason about spatial and semantic relationships between objects (edges) and to parse abstract user queries into actionable plans

Architecture

The complete pipeline for building the ConceptGraphs representation from an RGB-D sequence

Evaluation Highlights

+16.47 mAcc (mean Accuracy) improvement over ConceptFusion on open-vocabulary 3D semantic segmentation on the Replica dataset
Achieves 0.80 Recall@1 on complex negation queries (e.g., 'something to drink other than soda'), compared to 0.26 for CLIP-based retrieval
Demonstrates real-world utility on Jackal (wheeled) and Spot (legged) robots for abstract queries like 'Find something this guy would play with' (locating a basketball)

Breakthrough Assessment

8/10

Significantly advances open-vocabulary 3D mapping by moving from dense fields to structured graphs, enabling complex semantic reasoning (affordances/negation) previously difficult for robots.

⚙️ Technical Details

Problem Definition

Setting: Incremental construction of a 3D scene graph M_t = (O_t, E_t) from a sequence of posed RGB-D frames

Inputs: Sequence of RGB-D images I = {I_1, ..., I_t} with camera poses

Outputs: Open-vocabulary 3D scene graph with captioned object nodes and semantic relationship edges

Pipeline Flow

2D Segmentation (SAM) & Feature Extraction (CLIP)
3D Fusion (Point cloud generation & Association)
Node Captioning (LLaVA -> GPT-4)
Edge Generation (GPT-4)

System Modules

Class-agnostic Segmentation (Input Processing)

Generate candidate object masks from RGB images without assigning labels

Model or implementation: Segment Anything Model (SAM)

Feature Extractor (Input Processing)

Compute visual descriptors for each masked region

Model or implementation: CLIP (ViT-based image encoder)

Object Fusion & Association

Match new 2D detections to existing 3D objects using geometric and semantic similarity, or initialize new objects

Model or implementation: Custom greedy assignment logic using DBSCAN for denoising

Node Captioner (Graph Generation)

Generate and refine text descriptions for each mapped 3D object

Model or implementation: LLaVA (LVLM) + GPT-4 (LLM)

Edge Generator (Graph Generation)

Infer semantic and spatial relationships between objects

Model or implementation: GPT-4 (LLM)

Novel Architectural Elements

Decoupled mapping and semantic reasoning: Geometry is handled by traditional fusion, while semantics are handled by LVLM/LLM captioning and edge inference
LLM-driven edge construction: Using an LLM to infer edges based on node captions and positions, rather than training a specific relationship predictor

Modeling

Base Model: Ensemble of SAM, CLIP, LLaVA, and GPT-4

Compute: Multiple LVLM (LLaVA) and proprietary LLM (GPT-4) inferences required per scene; specific GPU hours not reported.

Comparison to Prior Work

vs. ConceptFusion: ConceptGraphs is object-based (nodes) rather than point-based (dense features), enabling structural reasoning
vs. Traditional 3DSGs: ConceptGraphs is open-vocabulary via foundation models, not limited to training classes
vs. OGSV: Uses LLMs for edge prediction instead of training a GNN relationship predictor

Limitations

Node captioning errors: LLaVA-7B sometimes hallucinates or misidentifies small objects (e.g., confusing objects with toothbrushes)
Redundant detections: Occasionally generates duplicate nodes for the same object
Computational cost: High latency and cost due to multiple calls to large proprietary models (GPT-4)

Reproducibility

Code: https://concept-graphs.github.io/

Code publicly available at project website. Uses off-the-shelf pretrained models (SAM, CLIP, LLaVA, GPT-4) without fine-tuning. Prompts for GPT-4 provided in Appendix. Specific versions: gpt-4-0613, LLaVA-7B.

📊 Experiments & Results

Evaluation Setup

3D semantic segmentation and Object Retrieval on simulated and real-world scans

Benchmarks:

Replica Dataset (Indoor 3D scene understanding (Simulated))
REAL Lab (Real-world robot navigation and manipulation) [New]

Metrics:

mAcc (mean Accuracy)
mIoU (mean Intersection over Union)
Node/Edge Precision (Human Eval)
Recall@k (for object retrieval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Semantic segmentation results on Replica dataset show ConceptGraphs significantly outperforming dense feature fusion methods.
Replica	mAcc	31.53	40.63	+9.10
Replica	mAcc	41.19	40.63	-0.56
Object retrieval experiments demonstrate the superiority of LLM-based reasoning over direct CLIP embeddings for complex queries.
Replica (Negation Queries)	Recall@1	0.26	0.80	+0.54
Replica (Affordance Queries)	Recall@1	0.43	0.57	+0.14
Human evaluation of the constructed scene graphs indicates high accuracy for relationship edges.
Replica (Average across scenes)	Edge Precision	Not reported in the paper	0.88	Not reported in the paper

Experiment Figures

Qualitative demonstration of a Jackal robot performing object search with LLM reasoning

Traversability estimation where the robot must push objects to reach a goal

Main Takeaways

ConceptGraphs enables robots to handle complex queries (affordance, negation) significantly better than raw visual embedding search (CLIP) by leveraging LLM reasoning over structured data.
The object-centric approach yields higher semantic segmentation accuracy (40.63 mAcc) compared to dense feature fusion methods like ConceptFusion (24.16 mAcc).
LLM-based edge inference allows the discovery of rich semantic relationships without training specialized graph neural networks, though node captioning quality relies heavily on the underlying LVLM.

📚 Prerequisite Knowledge

Prerequisites

3D Reconstruction (RGB-D mapping)
Vision-Language Models (CLIP, LLaVA)
Graph-based representations
Basic semantic segmentation

Key Terms

3D Scene Graph (3DSG): A structured representation where nodes represent objects in 3D space and edges represent spatial or semantic relationships between them

Open-vocabulary: The ability to recognize and reason about objects or concepts not explicitly defined or seen during the model's training phase

LVLM: Large Vision-Language Model—a model capable of understanding images and generating text descriptions (e.g., LLaVA)

SAM: Segment Anything Model—a foundation model for generating segmentation masks for any object in an image

CLIP: Contrastive Language-Image Pre-Training—a model that learns to associate images with text descriptions in a shared embedding space

IoU: Intersection over Union—a metric measuring the overlap between two bounding boxes or masks

Affordance: The actionable properties of an object (e.g., a chair 'affords' sitting)

DBSCAN: Density-Based Spatial Clustering of Applications with Noise—a clustering algorithm used here to clean 3D point clouds