← Back to Paper List

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi, Runpei Dong, Shaochen Zhang, Haoran Geng, Chunrui Han, Zheng Ge, Li Yi, Kaisheng Ma
Tsinghua University
European Conference on Computer Vision (2024)
MM Benchmark Pretraining Agent

📝 Paper Summary

3D Multimodal LLMs Embodied AI Point Cloud Representation Learning
ShapeLLM bridges 3D point clouds and language models using an enhanced encoder (ReCon++) that distills multi-view visual features via bipartite matching to enable accurate geometry-aware embodied interaction.
Core Problem
Existing 3D-LLMs struggle with accurate geometry understanding required for embodied tasks because they either rely on 2D rendered images (causing hallucinations) or limited single-view 3D distillation.
Why it matters:
  • Real-world agents need precise spatial information (e.g., 6-DoF pose) to manipulate objects, which 2D-based methods often lose
  • Current methods fail to capture multi-granularity semantics (both whole-part and high-resolution details) needed for complex interactions like opening a specific drawer handle
  • There is a 'data desert' for interactive 3D embodied tasks; existing datasets lack the instruction-following structure needed for agent planning
Concrete Example: When asking an agent to 'grasp the handle', image-based models might hallucinate the handle's position due to occlusion or viewpoint bias, whereas ShapeLLM uses point clouds to identify the precise 3D coordinates.
Key Novelty
ReCon++ Encoder with Selective Multi-View Distillation
  • Upgrades the ReCon 3D encoder by utilizing multi-view images (RGB + Depth) not just as augmentation, but as distillation targets
  • Uses a DETR-inspired bipartite matching (Hungarian algorithm) to selectively match 3D query tokens with the most relevant 2D view features, implicitly learning pose estimation and handling view disorder
Evaluation Highlights
  • +1.85% accuracy improvement on the ScanObjectNN benchmark compared to previous best records using the ReCon++ encoder
  • ReCon++ achieves 53.7% zero-shot accuracy on Objaverse-LVIS, surpassing Uni3D-L by +0.6%
  • ShapeLLM-13B achieves 49.3% total accuracy on the new 3D MM-Vet benchmark, outperforming PointLLM by +5.1%
Breakthrough Assessment
8/10
Strong contribution in unifying 3D point cloud processing with LLMs for embodied tasks. The ReCon++ encoder sets new SOTA on recognition, and the construction of 3D MM-Vet addresses a critical evaluation gap.
×