Segment Anything with Multiple Modalities

📝 Paper Summary

Visual Foundation Models Multi-Modal Sensor Fusion

MM-SAM adapts the pre-trained Segment Anything Model (SAM) to handle non-RGB sensors and fuse multi-modal data using lightweight, label-efficient modules without requiring ground-truth mask annotations.

Core Problem

The Segment Anything Model (SAM) is trained on RGB images and struggles with non-RGB sensor data (like LiDAR or depth) or multi-modal sensor suites, limiting its use in robotics and remote sensing.

Why it matters:

Robotics and autonomous vehicles rely on diverse sensor suites (LiDAR, thermal, depth) for robust perception, not just RGB cameras.
Existing methods to adapt SAM often require labor-intensive mask annotations for new modalities or rely on suboptimal data transformation (e.g., false-color images) that loses information.
Re-training SAM from scratch on new modalities is computationally prohibitive and limited by the scarcity of large-scale non-RGB datasets.

Concrete Example: When a standard SAM model is prompted with a point on a thermal image of a pedestrian at night, it may fail to segment the person because it only understands optical features. MM-SAM aligns the thermal features to SAM's RGB latent space, allowing successful segmentation without retraining the core model.

Key Novelty

Unsupervised Cross-Modal Transfer (UCMT) and Weakly-supervised Multi-Modal Fusion (WMMF)

Adapts SAM's image encoder to non-RGB data by forcing the new sensor's embeddings to statistically align with SAM's original RGB embedding space (Unsupervised Cross-Modal Transfer).
Introduces a Selective Fusion Gate that learns to weight and combine features from multiple sensors (e.g., RGB + Depth) based on confidence, trained only using geometric prompts rather than full masks (Weakly-supervised Multi-Modal Fusion).

Architecture

The overall architecture of MM-SAM, illustrating the two-stage pipeline: Unsupervised Cross-Modal Transfer (UCMT) and Weakly-supervised Multi-Modal Fusion (WMMF).

Evaluation Highlights

Outperforms standard SAM by +17.5% IoU on RGB-Thermal segmentation tasks (on VT5000 dataset) using the proposed fusion.
Achieves superior performance on depth (SUN-RGBD) and LiDAR (KITTI) modalities compared to vanilla SAM, with gains of +6.9% and +28.3% IoU respectively.
Requires only ~0.05% additional trainable parameters compared to the original SAM model, demonstrating extreme parameter efficiency.

Breakthrough Assessment

8/10

Significantly expands SAM's applicability to robotics and remote sensing without requiring expensive mask annotations, addressing a major bottleneck in deploying foundation models to real-world sensor suites.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot or few-shot transfer of a pre-trained RGB segmentation model to non-RGB (X) modalities and multi-modal (I, X) inputs.

Inputs: RGB image I, paired non-RGB sensor data X (e.g., depth, thermal, LiDAR), and geometric prompts (points/boxes).

Outputs: Binary segmentation mask M for the prompted object.

Pipeline Flow

Modality-Specific Encoding (RGB + X)
Unsupervised Cross-Modal Transfer (UCMT) Alignment
Weakly-supervised Multi-Modal Fusion (WMMF) / Selective Fusion Gate
Mask Decoding (SAM Decoder)

System Modules

Image Encoder (RGB) (Encoding)

Encodes standard RGB images using frozen pre-trained SAM ViT.

Model or implementation: ViT-B (SAM backbone, frozen)

Image Encoder (X-modality) (Encoding)

Encodes non-RGB data. Includes a trainable patch embedding layer and LoRA adapters within the frozen ViT blocks.

Model or implementation: ViT-B with LoRA (rank=4)

Selective Fusion Gate (SFG)

Computes spatial weight maps to fuse RGB and X embeddings.

Model or implementation: 2-layer CNN + Softmax

Mask Decoder

Generates segmentation masks from embeddings and prompts.

Model or implementation: SAM Mask Decoder (frozen)

Novel Architectural Elements

Selective Fusion Gate (SFG): A lightweight CNN module operating in the latent space to dynamically weight and fuse multi-modal feature maps patch-by-patch.
Separated Patch Embedding: A dedicated trainable patch embedding layer for non-RGB inputs that aligns dimensions before entering the shared (LoRA-adapted) ViT backbone.

Modeling

Base Model: Segment Anything Model (SAM) with ViT-B backbone

Training Method: Two-stage adaptation: (1) Unsupervised Cross-Modal Transfer (UCMT), (2) Weakly-supervised Multi-Modal Fusion (WMMF).

Objective Functions:

Purpose: Align non-RGB embeddings with RGB embeddings in the latent space (UCMT).

Formally: L_U = ||e_I - e_X||_2^2 (Mean Squared Error).
Purpose: Train the fusion gate using pseudo-labels generated from single-modality predictions (WMMF).

Formally: L_W = L_bce(M_pred, M_pseudo) + L_dice(M_pred, M_pseudo).

Adaptation: LoRA (Low-Rank Adaptation) injected into attention blocks of the image encoder; trainable patch embeddings for new modalities.

Trainable Parameters: 0.05M parameters (approx 0.05% of SAM's total parameters)

Key Hyperparameters:

lora_rank: 4
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper
+ 1 more
optimizer: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Vanilla SAM: MM-SAM includes explicit alignment adapters (LoRA) and fusion modules, whereas Vanilla SAM is unadapted.
vs. SAM-DA: MM-SAM supports multi-modal fusion, whereas SAM-DA focuses on single-modal domain adaptation.
vs. Adapters [not cited in paper]: Unlike standard adapter methods that require ground truth masks, MM-SAM is unsupervised/weakly-supervised.

Limitations

Relies on the quality of the pre-trained RGB encoder; if the RGB encoder fails (e.g., extreme darkness), the alignment target for the non-RGB modality might be poor.
The fusion mechanism (SFG) is relatively simple (spatial weighting) and might not capture complex cross-modal interactions.
Evaluation is limited to specific sensor pairs (RGB-Depth, RGB-Thermal, RGB-LiDAR); generalization to arbitrary sensors (e.g., hyperspectral) is not extensively tested.

Reproducibility

Code: https://xiaoaoran.github.io/projects/MM-SAM

Code is publicly available at https://xiaoaoran.github.io/projects/MM-SAM. The paper uses standard datasets (SUN-RGBD, KITTI, VT5000) which are publicly available. Exact training hyperparameters (LR, batch size, epochs) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluation on cross-modal (single non-RGB sensor) and multi-modal (sensor fusion) segmentation tasks.

Benchmarks:

SUN-RGBD (RGB-Depth Segmentation)
KITTI (RGB-LiDAR Segmentation)
VT5000 (RGB-Thermal Segmentation)

Metrics:

mIoU (Mean Intersection over Union)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Cross-modal adaptation results showing performance of MM-SAM on single non-RGB modalities compared to Vanilla SAM.
KITTI (LiDAR only)	mIoU	39.1	67.4	+28.3
VT5000 (Thermal only)	mIoU	64.1	78.4	+14.3
Multi-modal fusion results showing performance improvements when combining sensors (RGB + X) using the proposed Weakly-supervised Multi-Modal Fusion (WMMF).
VT5000 (RGB + Thermal)	mIoU	66.5	84.0	+17.5

Experiment Figures

Visual comparison of segmentation results between SAM and MM-SAM on RGB-Depth, RGB-LiDAR, and RGB-Thermal samples.

Main Takeaways

MM-SAM consistently improves segmentation performance on non-RGB modalities (Depth, LiDAR, Thermal) without requiring ground-truth mask annotations.
The Unsupervised Cross-Modal Transfer (UCMT) strategy effectively aligns heterogeneous sensor embeddings to the powerful RGB latent space of SAM.
The Weakly-supervised Multi-Modal Fusion (WMMF) significantly boosts performance over single-modality baselines, especially in challenging scenarios (e.g., thermal imaging where visual RGB cues are weak).
Parameter efficiency is high: adapting to new modalities requires training only a fraction (~0.05%) of the total model parameters.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision Transformers (ViT)
Familiarity with the Segment Anything Model (SAM) architecture
Basics of Low-Rank Adaptation (LoRA) for efficient tuning

Key Terms

SAM: Segment Anything Model—a foundation model for image segmentation capable of zero-shot generalization via prompting.

ViT: Vision Transformer—a model architecture based on self-attention mechanisms used as the image encoder in SAM.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that injects low-rank trainable matrices into frozen model layers.

IoU: Intersection over Union—a standard metric for evaluating segmentation accuracy measuring overlap between predicted and ground truth masks.

RGB-D: RGB plus Depth—multimodal data combining color images with depth information.

LiDAR: Light Detection and Ranging—a sensor method that measures distance to a target by illuminating the target with laser light.

Pseudo-labeling: A process where a model's high-confidence predictions on unlabeled data are used as ground truth for further training.