← Back to Paper List

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

Zhihao Zhang, Shengcao Cao, Yu-Xiong Wang
Xi’an Jiaotong University, University of Illinois Urbana-Champaign
Computer Vision and Pattern Recognition (2024)
MM Pretraining

📝 Paper Summary

3D Shape Understanding Multi-modal Representation Learning Cross-modal Alignment (3D-2D-Text)
TAMM improves 3D shape understanding by adapting CLIP's image features to the synthetic domain and decoupling 3D representations into separate visual and semantic sub-spaces.
Core Problem
Existing methods fail to fully leverage 2D images when pre-training 3D models because rendered images differ from CLIP's natural training images, and image/text features focus on conflicting attributes (visual vs. semantic).
Why it matters:
  • 3D datasets are small and expensive to annotate; transferring knowledge from abundant 2D/text data is crucial for scaling 3D learning
  • Aligning 3D shapes simultaneously with misaligned image and text features (e.g., color vs. function) confuses the model, leading to suboptimal representations
  • Directly using off-the-shelf CLIP (Contrastive Language-Image Pre-training) features on synthetic 3D renderings suffers from significant domain shift, degrading performance
Concrete Example: A 3D rendering of a chair might lack background textures seen in real photos, causing CLIP to misinterpret it. Meanwhile, an image feature might capture 'red color' while the text description only says 'office chair' (function), forcing the 3D encoder to align with contradictory signals if not decoupled.
Key Novelty
TriAdapter Multi-Modal Learning (TAMM)
  • First, a CLIP Image Adapter fine-tunes the visual encoder to close the domain gap between synthetic 3D renderings and natural images used in CLIP training
  • Second, Dual Adapters split the 3D feature space into two: a 'visual' sub-space aligned with images and a 'semantic' sub-space aligned with text, preventing conflict between modalities
Evaluation Highlights
  • Boosts zero-shot classification accuracy on Objaverse-LVIS from 46.8% (OpenShape baseline) to 50.7% using OpenShape's ensemble dataset
  • Improves 5-way 10-shot linear probing accuracy on ModelNet40 from 96.1% to 99.0% compared to ULIP baseline
  • Consistently enhances performance across diverse 3D architectures (Point-BERT, SparseConv) and pre-training datasets (ShapeNet, Objaverse)
Breakthrough Assessment
7/10
Solid architectural improvement for multi-modal 3D learning. The decoupling strategy (visual vs. semantic) effectively addresses a specific modality conflict ignored by prior work like ULIP.
×