Prompting Multi-Modal Image Segmentation with Semantic Grouping

📝 Paper Summary

Multi-modal learning Parameter-efficient fine-tuning Semantic segmentation

GoPT introduces semantic grouping into visual prompt tuning, enabling efficient adaptation of frozen RGB pre-trained models to multi-modal segmentation tasks by training less than 1% of parameters.

Core Problem

Existing multi-modal segmentation methods rely on full fine-tuning of RGB pre-trained models, which is parameter-inefficient and struggles to balance retaining pre-trained knowledge with learning modality-specific patterns.

Why it matters:

Full fine-tuning creates a large storage burden by requiring separate full model copies for every downstream task.
Limited labeled data in downstream multi-modal tasks (e.g., RGB-Thermal) leads to overfitting when updating all parameters.
Current fusion methods often fail to bridge cross-modal gaps (heterogeneity) or handle information imbalances between modalities.

Concrete Example: In RGB-Thermal segmentation, a standard alignment fusion might fail to align a pedestrian visible in thermal but obscured in RGB because it enforces global distribution matching rather than grouping specific semantic objects.

Key Novelty

Grouping Prompt Tuning (GoPT)

Inserts learnable prompt vectors into a frozen RGB backbone that are dynamically updated via a semantic grouping mechanism rather than static vectors.
Uses a Class-Aware Uni-Modal Prompter (CUP) to group auxiliary modality pixels into semantic clusters, enhancing intra-modal feature learning.
Uses an Alignment-Induced Cross-Modal Prompter (ACP) to inject these grouped auxiliary features into the RGB stream, guiding the frozen backbone to process multi-modal data.

Architecture

Overview of the GoPT framework showing the frozen foundation model and the inserted trainable prompters.

Evaluation Highlights

Achieves state-of-the-art 54.1% mIoU on NYUDv2 (RGB-D) using only 0.97M trainable parameters, outperforming full fine-tuning methods with >100M parameters.
Outperforms the specialized RSFNet on MFNet (RGB-T) with 57.4% mIoU vs 56.2%, despite freezing the backbone.
Surpasses previous SOTA on WHU-OS (RGB-SAR) by 2.8% Average Accuracy while training <1% of the model parameters.

Breakthrough Assessment

8/10

Significantly reduces parameter costs (trains <1%) while matching or beating SOTA full fine-tuning methods across three distinct multi-modal tasks (Depth, Thermal, SAR). A strong proof-of-concept for prompt tuning in dense prediction.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal semantic segmentation where a model pre-trained on RGB data is adapted to handle pairs of inputs {x_rgb, x_m}

Inputs: An RGB image x_rgb and a spatially aligned auxiliary modality image x_m (e.g., Depth, Thermal, SAR)

Outputs: A pixel-wise semantic segmentation mask S

Pipeline Flow

Input Processing (Patch Embedding for RGB and Aux)
Frozen RGB Backbone (L-layer ViT)
Grouping Prompters (inserted at layers 0 to L-1)
Segmentation Head (prediction)

System Modules

Patch Embedding

Converts RGB and auxiliary images into initial token sequences

Model or implementation: Linear Projection

Class-Aware Uni-Modal Prompter (CUP) (Prompt Generation)

Groups auxiliary tokens into semantic clusters to extract explicit class-aware features

Model or implementation: Learnable class tokens + Gumbel-Softmax Grouping

Alignment-Induced Cross-Modal Prompter (ACP) (Prompt Generation)

Aggregates uni-modal prompts and aligns them with RGB features to creating final visual prompts

Model or implementation: Cross-attention + Grouping mechanism

Vision Transformer Backbone

Processes RGB tokens modified by prompts to extract deep features

Model or implementation: ViT-Base or ViT-Large (pre-trained with MAE)

Segmentation Head

Maps final features to semantic class masks

Model or implementation: Task-specific prediction layer

Novel Architectural Elements

Hierarchical insertion of Grouping Prompters (CUP + ACP) into frozen ViT layers.
Coupled grouping mechanism where CUP learns uni-modal clusters and ACP projects them into cross-modal space.
Inheritance mechanism where class tokens for layer L+1 are initialized from prompts of layer L.

Modeling

Base Model: ViT-Base or ViT-Large pre-trained via MAE (Masked Autoencoder) on ImageNet

Training Method: Prompt Tuning (Visual Prompt Learning) with Gumbel-Softmax Grouping

Objective Functions:

Purpose: Minimize the difference between predicted masks and ground truth.

Formally: Standard Cross-Entropy Loss (implied for segmentation).

Adaptation: Prompt Tuning (updating only prompt parameters and segmentation head)

Trainable Parameters: 0.97M parameters (approx <1% of total model size)

Training Data:

NYUDv2: 795 train / 654 test
SUN RGB-D: 5,285 train / 5,050 test
MFNet: 784 train / 392 val / 393 test
WHU-OS: 60 train / 20 val / 20 test

Key Hyperparameters:

batch_size: 64
epochs: 60
optimizer: AdamW
+ 3 more
learning_rate: 4e-5
learning_rate_schedule: polynomial annealing
initialization: xavier uniform

Compute: 1 NVIDIA Tesla A100 GPU

Comparison to Prior Work

vs. VPT: VPT uses static learnable vectors; GoPT uses dynamic prompts generated by grouping auxiliary modality content.
vs. Full Fine-Tuning (FFT): FFT updates all parameters (high cost, overfitting risk); GoPT freezes backbone and updates <1% parameters.
vs. EVP: EVP focuses on low-level structure prompts; GoPT focuses on semantic grouping and cross-modal alignment.
+ 1 more
vs. RSFNet: RSFNet designs complex dual-stream architectures; GoPT uses a single frozen stream with lightweight prompts.

Limitations

Relies on a strong RGB pre-trained backbone (MAE-ViT); performance depends on the quality of this frozen foundation.
Hard assignment in grouping (Gumbel-Softmax) is non-trivial to tune compared to simple soft attention.
Requires spatially aligned multi-modal pairs; does not explicitly address severe misalignment or parallax issues beyond standard datasets.

Reproducibility

No code URL provided in the paper. The paper uses standard public datasets (NYUDv2, SUN RGB-D, MFNet, PST900, WHU-OS). Implementation details (learning rate, batch size, GPU) are provided.

📊 Experiments & Results

Evaluation Setup

Semantic segmentation on multi-modal datasets (RGB+Depth, RGB+Thermal, RGB+SAR).

Benchmarks:

NYUDv2 (Indoor RGB-D Semantic Segmentation)
SUN RGB-D (Indoor RGB-D Semantic Segmentation)
MFNet (Urban RGB-T Semantic Segmentation)
PST900 (Underground RGB-T Semantic Segmentation)
WHU-OS (Remote Sensing RGB-SAR Classification)

Metrics:

mIoU (mean Intersection over Union)
Pixel Accuracy (PAcc)
Mean Accuracy (mAcc)
Kappa Coefficient
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GoPT outperforms existing state-of-the-art methods on RGB-D datasets while using significantly fewer trainable parameters.
NYUDv2	mIoU	53.3	54.1	+0.8
SUN RGB-D	mIoU	49.8	52.3	+2.5
In RGB-Thermal segmentation, GoPT surpasses specialized architectures designed specifically for this task.
MFNet	mIoU	56.2	57.4	+1.2
PST900	mIoU	80.5	81.5	+1.0
GoPT shows strong generalization to remote sensing data (RGB-SAR).
WHU-OS	Average Accuracy (AA)	66.5	69.3	+2.8
Ablation studies demonstrate the efficiency and effectiveness of the proposed components compared to standard visual prompting.
NYUDv2	mIoU	52.8	54.1	+1.3
NYUDv2	mIoU	53.1	54.1	+1.0

Experiment Figures

Detailed structure of the Uni-modal Grouping Module (CUP).

Line graph showing the impact of the number of grouping prompters on performance.

Main Takeaways

GoPT consistently outperforms Full Fine-Tuning (FFT) across all datasets while modifying less than 1% of the parameters, validating the efficacy of prompt tuning for dense prediction.
Hard assignment (one-hot) in the grouping mechanism works significantly better than soft assignment, likely by reducing ambiguity in feature clustering.
The method is modality-agnostic, showing improvements in RGB-D, RGB-T, and RGB-SAR tasks without specific architectural changes for each.
Increasing the number of inserted prompters (decreasing insertion interval) positively correlates with segmentation performance.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformers (ViT) and Patch Embeddings
Visual Prompt Tuning (VPT) concepts
Multi-modal fusion strategies (alignment vs. aggregation)
Gumbel-Softmax for differentiable categorical sampling

Key Terms

GoPT: Grouping Prompt Tuning—the proposed framework that uses semantic grouping to generate visual prompts for multi-modal adaptation.

CUP: Class-Aware Uni-Modal Prompter—a module that groups auxiliary modality tokens into semantic clusters to learn modality-specific prompts.

ACP: Alignment-Induced Cross-Modal Prompter—a module that aggregates the learned uni-modal prompts and injects them into the RGB backbone via cross-attention.

MAE: Masked Autoencoder—a self-supervised pre-training method for Vision Transformers used here as the foundation model initialization.

RGB-D: Red-Green-Blue plus Depth—a multi-modal format combining color images with distance information.

RGB-T: Red-Green-Blue plus Thermal—a multi-modal format combining color images with infrared thermal data.

RGB-SAR: Red-Green-Blue plus Synthetic Aperture Radar—a multi-modal format used in remote sensing.

Gumbel-Softmax: A reparameterization trick that allows differentiable sampling from a categorical distribution, used here to assign pixels to semantic groups.

mIoU: Mean Intersection over Union—a standard metric for semantic segmentation accuracy.

hard assignment: Assigning a token to exactly one group (one-hot) rather than a probability distribution, found to be more effective in this paper.