Customize Segment Anything Model for Multi-Modal Semantic Segmentation with Mixture of LoRA Experts

📝 Paper Summary

Multi-modal semantic segmentation Parameter-Efficient Fine-Tuning (PEFT) Foundation Model Adaptation

The paper adapts the Segment Anything Model (SAM) for multi-modal semantic segmentation by using modality-specific LoRA experts and a dynamic routing mechanism to fuse features from diverse sensors like depth and LiDAR.

Core Problem

SAM is trained primarily on RGB images and performs sub-optimally when applied to diverse modalities (depth, thermal, event data) or when modalities are missing/noisy.

Why it matters:

Robotics and autonomous driving rely on multi-modal sensors (LiDAR, thermal) for robustness, but current foundation models like SAM are RGB-centric.
Existing methods struggle with cross-modal inconsistencies (different noise levels, resolutions) and lack mechanisms to handle missing modalities gracefully.
Full fine-tuning of large models like SAM for every new modality is computationally prohibitive.

Concrete Example: In a scenario with missing modalities (e.g., camera failure leaving only depth data), standard multi-modal models often fail catastrophically because they expect complete inputs. The proposed method improves performance by +32.15% in such missing-modality settings on the MUSES dataset.

Key Novelty

Mixture of LoRA Experts (MoE-LoRA) with Dynamic Routing

Instead of fine-tuning the whole model, it freezes SAM's backbone and trains lightweight LoRA modules specifically for each modality (RGB, Depth, etc.).
A 'Mixture of LoRA Experts' (MLE) router dynamically assigns weights to features from different modalities based on their relevance, allowing the model to ignore noisy or missing inputs.
A dual-pathway decoder combines SAM's original mask decoder with a new auxiliary head to fuse multi-scale features for better semantic accuracy.

Architecture

The overall architecture of MLE-SAM, detailing the modality-specific LoRA encoders, the feature pyramid network, the MoE routing mechanism, and the dual-pathway decoder.

Evaluation Highlights

+28.14% mIoU improvement on the MUSES dataset (3 modalities) compared to state-of-the-art methods.
+32.15% performance gain on the MUSES dataset under missing modality conditions compared to existing approaches.
+4.9% mIoU improvement on the DELIVER dataset (4 modalities) compared to state-of-the-art methods.

Breakthrough Assessment

8/10

Significant performance jumps (>20%) in multi-modal and missing-modality settings indicate a strong architectural fit for the problem. Successfully adapting a major foundation model (SAM) to multi-modal tasks with parameter efficiency is a valuable contribution.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal semantic segmentation where input X contains M modalities (e.g., RGB, Depth, Event) and the goal is pixel-wise class labels.

Inputs: Set of registered images from M modalities X = {X^m | m ∈ [1, M]}

Outputs: Semantic segmentation mask Y assigning a class label to every pixel

Pipeline Flow

Modality-Specific Encoders (Frozen SAM backbone + Modality-specific LoRA)
Feature Pyramid Network (FPN) extraction (SFM, FFP, IFP)
Mixture of LoRA Experts (MLE) Routing (Dynamic weighting & Fusion)
Dual-Pathway Decoding (SAM Mask Decoder + Auxiliary Head)

System Modules

Image Encoder with LoRA

Extract features from each modality independently while adapting to non-RGB data

Model or implementation: Hiera backbone (from SAM2) with frozen weights + trainable LoRA layers

Feature Pyramid Network (FPN)

Refine hierarchical features into three scales: Semantic Feature Map (SFM), Fine-grained (FFP), and Intermediate (IFP)

Model or implementation: Convolutional layers with lateral and top-down connections

MoE Router

Calculate dynamic weights for each modality to prioritize useful information and suppress noise

Model or implementation: Linear layer + Softmax on spatially averaged features

Dual-Pathway Decoder

Generate segmentation masks using both SAM's prompt-based decoder and a standard auxiliary segmentation head

Model or implementation: Modified SAM Mask Decoder + MLP-based Auxiliary Head

Novel Architectural Elements

Mixture of LoRA Experts (MLE) routing strategy that adaptively weights features across modalities based on spatial embeddings
Dual-pathway segmentation head combining SAM's mask decoder with an auxiliary MLP head for multi-scale fusion

Modeling

Base Model: SAM2 (Segment Anything Model 2) with Hiera backbone

Training Method: Supervised fine-tuning with LoRA and MoE routing

Adaptation: LoRA (Low-Rank Adaptation) applied to query/value projections in attention layers

Trainable Parameters: LoRA layers, MoE router, FPN neck, and segmentation heads (backbone is frozen)

Training Data:

DELIVER dataset (Depth, Event, LiDAR, RGB)
MUSES dataset (RGB, Depth, Thermal)
MCubeS dataset (RGB, Depth, Normal, Thermal)

Compute: Not reported in the paper

Comparison to Prior Work

vs. CMX: Uses a foundation model (SAM) backbone with LoRA rather than training from scratch or using standard backbones [not cited in paper but implied comparator class]
vs. CWSAM: Uses Mixture of Experts routing for multi-modal fusion rather than single-modality adaptation
vs. Standard SAM: Adds semantic segmentation capability and handles non-RGB modalities explicitly via LoRA

Limitations

No specific computational cost or inference latency analysis provided.
Relies on the quality of the pre-trained SAM backbone; limitations of SAM (e.g., regarding specific domain artifacts) may propagate.
The paper does not explicitly detail the training loss functions used.

Reproducibility

Code availability is not provided in the paper. The method relies on standard benchmarks (DELIVER, MUSES, MCubeS) and the open-source SAM2 model.

📊 Experiments & Results

Evaluation Setup

Semantic segmentation on multi-modal datasets including challenging conditions (adverse weather, sensor failure).

Benchmarks:

DELIVER (Multi-modal segmentation (RGB, Depth, Event, LiDAR))
MUSES (Multi-modal segmentation (RGB, Depth, Thermal))
MCubeS (Multi-modal segmentation (RGB, Depth, Normal, Thermal))

Metrics:

mIoU (mean Intersection over Union)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The proposed MLE-SAM method significantly outperforms state-of-the-art methods on standard multi-modal segmentation benchmarks.
DELIVER	mIoU improvement	55.45	60.35	+4.90
MUSES	mIoU improvement	Not reported in the paper	Not reported in the paper	+28.14
The method demonstrates exceptional robustness under challenging conditions, specifically missing modalities.
MUSES	mIoU improvement	Not reported in the paper	Not reported in the paper	+32.15
DELIVER	mIoU improvement	Not reported in the paper	Not reported in the paper	+14.13

Experiment Figures

Performance comparison bar charts (b-e) and a high-level concept diagram (a).

Main Takeaways

The MoE-LoRA architecture effectively bridges the gap between SAM's RGB-centric pre-training and multi-modal requirements.
The dynamic routing mechanism is particularly effective for missing modalities, showing massive gains (+32.15%) by likely routing around empty inputs.
Dual-pathway decoding successfully leverages both SAM's promptable features and standard semantic features for better accuracy.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the Segment Anything Model (SAM) architecture (ViT backbone, mask decoder)
Knowledge of Low-Rank Adaptation (LoRA) for fine-tuning
Familiarity with Mixture of Experts (MoE) concepts

Key Terms

SAM: Segment Anything Model—a foundation model for image segmentation trained on 1 billion masks

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by training small rank-decomposition matrices while freezing original weights

MoE: Mixture of Experts—an architecture where different sub-models ('experts') are selectively activated for different inputs

mIoU: mean Intersection over Union—a standard metric for semantic segmentation accuracy measuring overlap between predicted and ground truth masks

LiDAR: Light Detection and Ranging—a remote sensing method that uses light in the form of a pulsed laser to measure variable distances

Event Camera: A bio-inspired sensor that measures changes in brightness at each pixel asynchronously, rather than capturing standard frames

SFM: Semantic Feature Map—high-level features extracted by the encoder

FFP: Fine-grained Feature Pyramid—high-resolution features for detailed segmentation

IFP: Intermediate-resolution Feature Pyramid—mid-level features for segmentation