Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

📝 Paper Summary

Multi-modal learning Semantic segmentation State Space Models (SSMs)

Sigma applies the Mamba state space model to multi-modal semantic segmentation, using a Siamese architecture to fuse RGB and supplementary modalities with linear complexity instead of the quadratic cost of Transformers.

Core Problem

Existing multi-modal segmentation models either suffer from limited local receptive fields (CNNs) or computationally expensive quadratic complexity (ViTs), making them inefficient for handling high-resolution multi-modal data.

Why it matters:

Autonomous agents need robust perception in adverse conditions (low light, glare) where RGB fails but Thermal/Depth succeed.
Current ViT-based solutions scale poorly with image resolution due to self-attention, limiting real-time applicability.
CNN-based solutions lack global context, leading to misclassifications in complex scenes.

Concrete Example: In a scene with a round chair next to a sofa, shadows cause baseline models to fragment the chair into multiple incorrect segments. Sigma effectively utilizes depth information to recognize the chair as a singular entity.

Key Novelty

Siamese Mamba Network (Sigma)

Replaces the standard Transformer or CNN backbone with a Siamese Mamba encoder that processes RGB and X-modality (Thermal/Depth) streams in parallel with linear complexity.
Introduces a Mamba-based fusion mechanism that acts like cross-attention but uses Selective Scan operations to exchange and concatenate features efficiently.
Employs a Channel-Aware Decoder that enhances the standard Mamba block with channel attention (pooling) to better select vital feature channels during upsampling.

Architecture

The overall architecture of Sigma, including the Siamese Mamba Encoder, Fusion Module, and Channel-Aware Decoder.

Evaluation Highlights

Outperforms state-of-the-art CMNeXt by >2% mIoU on the PST900 RGB-Thermal dataset.
Achieves higher accuracy than CMNeXt on NYU Depth V2 while using 49.8M fewer parameters (Sigma-Small vs CMNeXt-B2).
Surpasses current methods on the MFNet dataset with fewer FLOPs and parameters in the tiny model variant.

Breakthrough Assessment

8/10

First successful application of Mamba (SSM) to multi-modal semantic segmentation. It successfully addresses the quadratic complexity bottleneck of Transformers while maintaining global receptive fields, showing strong empirical results.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal semantic segmentation assigning a class label to every pixel

Inputs: Paired images: RGB image I_rgb and X-modality image I_x (Thermal or Depth)

Outputs: Pixel-wise segmentation map

Pipeline Flow

Siamese Encoder (RGB branch + X branch) → Multi-level Feature Extraction
Fusion Module (CroMB + ConMB) at each scale
Channel-Aware Decoder (CAVSSB + Upsampling) → Prediction

System Modules

Siamese Mamba Encoder

Extract multi-scale features from RGB and X modalities using shared weights

Model or implementation: Cascaded Visual State Space Blocks (VSSB) initialized with VMamba weights

Cross Mamba Block (CroMB) (Fusion)

Facilitate interaction between modalities by using the selective scan mechanism as a cross-attention proxy

Model or implementation: Cross Selective Scan (Cross SS)

Concat Mamba Block (ConMB) (Fusion)

Integrate enhanced features by concatenating and scanning them as a single long sequence

Model or implementation: Concat Selective Scan (Concat SS) with inverse scanning

Channel-Aware Decoder

Recover spatial resolution while enhancing channel selection

Model or implementation: Channel-Aware Visual State Space Blocks (CAVSSB)

Novel Architectural Elements

Siamese Mamba Encoder: Weight-sharing VSSB branches for multi-modal input
Mamba-based Fusion (CroMB & ConMB): Replaces cross-attention with cross-selective-scan and concatenated-scan operations
Channel-Aware Decoder: Integrates channel attention into the Mamba block to compensate for VSSB's weakness in inter-channel modeling

Modeling

Base Model: VMamba (initialized with ImageNet-1K pretrained weights)

Training Method: Supervised learning with cross-entropy loss (implied standard for segmentation)

Adaptation: Fine-tuning on RGB-T/RGB-D datasets

Trainable Parameters: Three variants: Sigma-Tiny, Sigma-Small, Sigma-Base (parameter counts vary, e.g., Small is 69.8M)

Training Data:

MFNet: 820 day / 749 night images (RGB-T)
PST900: 597 train / 288 test (RGB-T)
NYU Depth V2: 795 train / 654 test (RGB-D)
SUN RGB-D: 5285 train / 5050 test (RGB-D)

Key Hyperparameters:

optimizer: AdamW
learning_rate: 6e-5
weight_decay: 0.01
+ 3 more
batch_size: 8
epochs: 500
scheduler: Not reported in the paper

Compute: Not reported in the paper (specific GPU hardware or training time)

Comparison to Prior Work

vs. CMX/CMNeXt: Sigma uses Mamba (linear complexity) instead of ViT/SegFormer (quadratic complexity) for the backbone.
vs. CNN methods (EGFNet, MTANet): Sigma offers global receptive fields which CNNs lack.
vs. ViT-based Fusion: Sigma replaces heavy cross-attention mechanisms with efficient selective scan operations.

Limitations

No detailed ablation of the specific computational cost (latency/memory) of the fusion module alone compared to cross-attention.
The paper does not explicitly report statistical significance tests (p-values) for the improvements.
Relies on ImageNet pre-training; behavior from scratch is not explored.

Reproducibility

Code: https://github.com/zifuwan/Sigma

Code is publicly available at https://github.com/zifuwan/Sigma. Pretrained VMamba weights are used for initialization. Dataset splits follow standard protocols (cited in paper).

📊 Experiments & Results

Evaluation Setup

Semantic segmentation on RGB-Thermal and RGB-Depth datasets

Benchmarks:

MFNet (RGB-Thermal Semantic Segmentation)
PST900 (RGB-Thermal Semantic Segmentation)
NYU Depth V2 (RGB-Depth Semantic Segmentation)
SUN RGB-D (RGB-Depth Semantic Segmentation)

Metrics:

mIoU (mean Intersection over Union)
FLOPs (Floating Point Operations)
Parameters (Model size)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RGB-Thermal segmentation results show Sigma consistently outperforming baselines on MFNet and PST900.
MFNet	mIoU	59.7	60.1	+0.4
PST900	mIoU	70.68	72.82	+2.14
RGB-Depth segmentation results demonstrate Sigma's strong generalization and efficiency.
NYU Depth V2	mIoU	54.6	55.0	+0.4
SUN RGB-D	mIoU	50.4	51.1	+0.7
Ablation studies confirm the contribution of individual fusion and decoding components.
MFNet	mIoU	54.1	59.0	+4.9

Experiment Figures

Accuracy (mIoU) vs. Efficiency (FLOPs/Params) comparison on MFNet dataset.

Main Takeaways

Sigma achieves state-of-the-art performance across RGB-T and RGB-D tasks, validating the efficacy of Mamba for multi-modal fusion.
The model is significantly more parameter-efficient than Transformer baselines (e.g., Sigma-Small beats CMNeXt-B2 with ~42% fewer parameters on NYU Depth V2).
Qualitative results show better handling of shadows and complex objects (e.g., tactile paving, bollards) compared to baselines.
The specialized Mamba fusion modules (CroMB and ConMB) provide substantial gains over simple summation fusion.

📚 Prerequisite Knowledge

Prerequisites

Semantic Segmentation architectures (Encoder-Decoder)
State Space Models (SSM) and discretization
Attention mechanisms (Self-attention, Cross-attention)

Key Terms

Mamba: A selective structured state space model (S6) that models sequences with linear complexity while maintaining a global receptive field

Siamese network: An architecture with two identical subnetworks (branches) that share weights, used here to process two different modalities (RGB and X)

SS2D: Selective Scan 2D—a module that scans 2D feature maps in four directions (corners to opposite corners) to model spatial dependencies using 1D SSMs

VSSB: Visual State Space Block—the basic building block of the encoder, containing SS2D modules for spatial modeling

CroMB: Cross Mamba Block—a fusion module where selective scan parameters (matrices B, C, Delta) are generated from one modality to modulate the other, enabling cross-modal interaction

ConMB: Concat Mamba Block—a fusion module that concatenates features from two modalities and scans them jointly (and inversely) to integrate information

CAVSSB: Channel-Aware Visual State Space Block—a decoder block that adds channel attention (pooling) to the standard VSSB to enhance channel-specific feature selection

RGB-T: RGB-Thermal imaging

RGB-D: RGB-Depth imaging

mIoU: Mean Intersection over Union—the standard metric for semantic segmentation accuracy