LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

📝 Paper Summary

Multi-modal Instruction Tuning 3D Vision-Language Models Evaluation Benchmarks

LAMM introduces a comprehensive open-source ecosystem including a multi-modal instruction-tuning dataset covering 2D images and 3D point clouds, a training framework, and a benchmark for evaluating MLLMs on vision tasks.

Core Problem

Existing Multi-modal Large Language Models (MLLMs) are either closed-source (GPT-4V) or lack comprehensive benchmarks and datasets, particularly for 3D modalities and fine-grained localization tasks.

Why it matters:

Current open-source MLLMs struggle with fine-grained tasks like object detection and counting because their training data lacks dense visual annotations converted into language instructions
There is a lack of standardized benchmarks to quantitatively evaluate MLLMs on traditional computer vision tasks (detection, OCR, etc.) beyond simple conversation
3D vision-language research is lagging behind 2D due to the scarcity of high-quality 3D instruction-following datasets

Concrete Example: In object counting tasks (e.g., 'How many seashells are there?'), existing MLLMs like MiniGPT-4 often fail to output a number or hallucinate, whereas LAMM's dataset includes specific counting instruction templates to teach this capability.

Key Novelty

Language-Assisted Multi-Modal (LAMM) Ecosystem

First open-source instruction tuning dataset to include 3D point clouds alongside images, enabling MLLMs to understand and reason about 3D environments
Constructs 'Visual Task Dialogues' by converting dense vision annotations (bounding boxes, keypoints) into instruction-response pairs using GPT-API, enhancing model grounding capabilities
Proposes two new evaluation strategies for MLLMs: a Binary Locating Metric for object grounding and a GPT-based scoring metric for caption quality

Architecture

The LAMM training framework supporting multiple modalities (Images and Point Clouds) with a shared LLM.

Evaluation Highlights

LAMM baseline outperforms LLaVA by +174% relative improvement (14.73 -> 31.2) on the proposed Binary Locating Metric, demonstrating superior grounding ability
Achieves 49.88% accuracy on ScienceQA (Image), outperforming MiniGPT-4 (43.43%) and mPLUG-owl (36.39%) in zero-shot settings
Successfully extends MLLM capabilities to 3D tasks, achieving 26.54% accuracy on ScanQA (3D VQA) in zero-shot and 99.89% after fine-tuning (though likely overfitting due to small dataset size)

Breakthrough Assessment

8/10

Significant contribution as one of the first comprehensive benchmarks and datasets covering both 2D and 3D modalities. The framework is standard, but the dataset construction methodology and 3D extension are highly valuable.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal instruction tuning where an LLM is fine-tuned to generate text responses given text instructions and visual inputs (2D images or 3D point clouds)

Inputs: Natural language instruction $X_q$ and visual input $X_v$ (image or point cloud)

Outputs: Natural language response $Y$

Pipeline Flow

Visual Encoder (Image or Point Cloud)
Trainable Projector
LLM Backbone (with LoRA)

System Modules

Image Encoder (Input Processing)

Extract visual features from 2D images

Model or implementation: CLIP ViT-L/14 (Pre-trained)

Point Cloud Encoder (Input Processing)

Extract visual features from 3D point clouds

Model or implementation: PointNet++ tokenizer + CLIP ViT-L/14 (FrozenCLIP style)

Projector

Project visual features into the text embedding space of the LLM

Model or implementation: Trainable Projection Layer

LLM Backbone

Generate text response based on concatenated visual and text tokens

Model or implementation: Vicuna-13B

Novel Architectural Elements

Integration of a 3D point cloud encoder (PointNet++ -> CLIP) into the instruction tuning framework alongside the image encoder
Modality-specific LoRA modules (parameters for different vision modalities are not shared) within the shared LLM

Modeling

Base Model: Vicuna-13B

Training Method: Supervised Fine-Tuning (Instruction Tuning)

Objective Functions:

Purpose: Minimize the difference between generated text and ground truth response.

Formally: Standard auto-regressive language modeling loss (Cross-Entropy) on the prediction tokens.

Adaptation: LoRA (rank=32) applied to projection layers in self-attention; separate LoRA params for different modalities

Trainable Parameters: Projection layers and LoRA parameters only

Training Data:

Image Dataset: 186,098 pairs (Daily Dialogue, Factual Knowledge, Detailed Description, Visual Task Dialogue)
Point Cloud Dataset: 10,262 pairs
Sources: COCO, S3DIS, Bamboo, CLEVR3D, etc., processed via GPT-API

Key Hyperparameters:

lora_rank: 32
image_resolution: 224x224
image_patches: 256

Compute: 4 A100 GPUs for ~24 hours (baseline model training)

Comparison to Prior Work

vs. LLaVA: LAMM includes 3D point cloud data and specific 'Visual Task Dialogue' data (bounding boxes -> text) to improve localization.
vs. MiniGPT-4: LAMM provides a comprehensive benchmark (11 image datasets + 3 point cloud datasets) rather than qualitative demos.
vs. GPT-4V [not cited in paper - concurrent]: LAMM is open-source and provides transparent training data/code.

Limitations

Dataset generated by GPT-API relies on text-only inputs (captions/boxes), lacking direct visual access, leading to potential hallucinations or missing details.
Metrics fluctuation due to diversity of language model outputs.
Performance on fine-grained classification (e.g., CIFAR10) and OCR (SVT) is lower than specialized baselines or LLaVA in some zero-shot settings.
3D dataset scale is relatively small (10k samples), leading to potential overfitting in fine-tuning experiments.

Reproducibility

Code: https://openlamm.github.io/

Code, data, and models are publicly available at https://openlamm.github.io/. Baseline trained on 4 A100s. Framework supports V100 and RTX3090.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on unseen datasets and Fine-tuning evaluation on specific tasks.

Benchmarks:

LAMM-Benchmark (2D) (9 tasks including Detection, VQA, OCR, Counting) [New]
LAMM-Benchmark (3D) (3 tasks including 3D Detection, Visual Grounding, 3D VQA) [New]

Metrics:

Traditional Metrics (Acc, mAP, MAE)
Binary Locating Metric (accuracy of predicted position vs GT box)
GPT Metric (score 1-10 evaluated by GPT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance on 2D vision tasks shows LAMM excels at VQA and Localization compared to other open-source MLLMs.
VOC2012 (Object Detection)	mAP	1.42	7.20	+5.78
SQAimage (VQA)	Accuracy	43.43	49.88	+6.45
Binary Locating Metric	Accuracy	14.73	31.2	+16.47
GPT Metric	Score	50.16	48.44	-1.72
CIFAR10 (Classification)	Accuracy	60.83	37.9	-22.93
3D task performance establishes a baseline for future research.
ScanQA (3D VQA)	Accuracy	Not reported in the paper	26.54	Not reported in the paper

Experiment Figures

Overview of the dataset construction pipeline using GPT-API.

Analysis of counting performance and training data ablation.

Main Takeaways

LAMM demonstrates stronger localization abilities (higher mAP and Binary Locating scores) compared to LLaVA and MiniGPT-4, attributed to the inclusion of explicit detection-oriented instruction data.
The model struggles with counting tasks (high MAE) and fine-grained classification (lower CIFAR10 accuracy) in zero-shot settings, indicating limitations in current MLLM visual reasoning.
Introduction of 3D point cloud modality allows the model to perform 3D VQA and grounding, establishing a new baseline for multi-modal agents in 3D environments.
GPT-based evaluation metrics correlate better with human judgment for captioning than traditional BLEU scores, where LAMM produces more detailed but lower-BLEU captions.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture and Vision Transformers (ViT)
Large Language Models (LLMs) and Instruction Tuning
CLIP (Contrastive Language-Image Pre-training)
LoRA (Low-Rank Adaptation)

Key Terms

MLLM: Multi-modal Large Language Model—an LLM adapted to accept inputs from other modalities like images or audio

Instruction Tuning: Fine-tuning a pre-trained LLM on dataset of (instruction, output) pairs to improve its ability to follow user commands

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices

CLIP: Contrastive Language-Image Pre-training—a model trained to align images and text in a shared embedding space

Point Cloud: A set of data points in a 3D coordinate system, commonly used to represent 3D shapes or scenes

PointNet++: A deep neural network architecture that directly processes point clouds by learning hierarchical features

Zero-shot: Evaluating a model on tasks or classes it has not explicitly seen during training

Binary Locating Metric: A metric proposed in this paper where the model's output location is considered correct if it falls within the ground-truth bounding box