SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

📝 Paper Summary

Remote Sensing Object Detection Multi-Task Learning Multi-Modal Learning

SM3Det is a unified remote sensing detection model that uses a grid-level sparse Mixture of Experts backbone and dynamic submodule optimization to jointly learn from diverse modalities and tasks without conflict.

Core Problem

Traditional remote sensing models are trained on single datasets/modalities, while unified training fails due to representation bottlenecks (crowded feature space) and optimization inconsistencies (varying learning difficulties) across disparate modalities like SAR, RGB, and Infrared.

Why it matters:

Airborne platforms (UAVs, satellites) carry multiple sensors, requiring simultaneous processing of diverse data streams rather than maintaining separate models for each.
Existing multi-source methods rely on strictly paired, spatially aligned images, which are scarce and inflexible for real-world applications.
Jointly training a single dense model on conflicting modalities often degrades performance compared to specialized models due to task interference.

Concrete Example: When a single dense model tries to learn both SAR (radar) and Optical (RGB) object detection simultaneously, the distinct pattern concepts (scattering vs. visual) cause feature interference, leading the unified model to underperform compared to two separate models trained individually.

Key Novelty

Grid-level Sparse MoE + Dynamic Submodule Optimization (DSO)

Integrates sparse Mixture of Experts (MoE) into the backbone at the spatial grid level, allowing the model to dynamically route local image patches to experts specialized for specific modalities or shared features.
Introduces Dynamic Submodule Optimization (DSO) to synchronize learning speeds by adjusting learning rates based on task loss history and preventing the shared backbone from overfitting to the hardest task at the expense of others.

Architecture

The overall architecture of SM3Det, illustrating the shared backbone with MoE layers and the separate task heads.

Evaluation Highlights

Outperforms the Single-Task (specialized) baseline by +1.45% mAP on the unified M2Det benchmark consisting of DOTA, SARDet-100K, and DroneVehicle datasets.
Achieves 74.30% mAP on the combined benchmark using the Swin-T backbone, surpassing the dense Multi-Task baseline of 71.95%.
The lightweight version reduces parameters significantly while maintaining high performance, demonstrating better parameter efficiency than maintaining separate models.

Breakthrough Assessment

7/10

Strong engineering contribution defining a new practical task (M2Det) and effectively solving the modality-interference problem in unified models using MoE and optimization tricks. While MoE is known, the grid-level application to remote sensing backbones is novel.

⚙️ Technical Details

Problem Definition

Setting: Multi-Modal Datasets and Multi-Task Object Detection (M2Det)

Inputs: Remote sensing images from arbitrary modalities (Optical, SAR, Infrared) without requiring spatial alignment or pairing.

Outputs: Object detections in specified formats (Horizontal Bounding Box or Oriented Bounding Box) depending on the dataset/task.

Pipeline Flow

Input Image (Any Modality)
Sparse MoE Backbone (Grid-level routing)
Neck (FPN)
Task-Specific Heads (HBB/OBB)

System Modules

Sparse MoE Backbone

Extracts features using dynamic experts to handle modality differences.

Model or implementation: Modified Swin-T or LSKNet with MoE layers

Dynamic Submodule Optimization (DSO)

Adjusts learning rates for backbone and heads dynamically.

Model or implementation: Algorithm based on loss history (EMA) and KL Divergence

Task Heads

Predicts bounding boxes and classes specific to the dataset/task.

Model or implementation: Standard detection heads (e.g., from RetinaNet or Rotated RetinaNet)

Novel Architectural Elements

Grid-level MoE routing in backbone: Experts are selected per spatial location in the feature map rather than per image, allowing fine-grained modality handling.
Dynamic Submodule Optimization (DSO) integration: A mechanism that dynamically scales learning rates of specific network modules (backbone vs. heads) based on real-time training stability metrics.

Modeling

Base Model: Swin-Transformer (Swin-T) or LSKNet (Backbones)

Training Method: Joint training on combined datasets using DSO

Objective Functions:

Purpose: Balance task convergence speeds.

Formally: Reweigh head LR based on ratio of smoothed historical loss to current loss.
Purpose: Ensure stable backbone updates.

Formally: Reweigh backbone LR based on consistency score (KL divergence between current and historical loss distributions).

Training Data:

Benchmark constructed by merging: SARDet-100K (SAR, HBB), DOTA-v1.0 (Optical, OBB), DroneVehicle (Infrared/Optical, OBB).

Key Hyperparameters:

top_k_experts: 1
number_of_experts: 6
base_learning_rate: Not explicitly reported in the paper
+ 1 more
optimizer: AdamW

Compute: Not reported in the paper

Comparison to Prior Work

vs. Single-Task: SM3Det uses one model for all, saving parameters, and outperforms single-task models by leveraging shared knowledge.
vs. Multi-Task (Dense): SM3Det uses Sparse MoE to prevent feature interference/crowding, achieving higher accuracy.
vs. GradNorm/DWA: SM3Det adjusts Learning Rates (LR) of submodules directly rather than just loss weights or gradients, offering finer control over optimization consistency.

Limitations

Complexity of implementation compared to simple joint training.
Requires careful tuning of the DSO hyperparameters (temperature, bias).
Evaluation is limited to three specific remote sensing datasets.

Reproducibility

Code: https://github.com/zcablii/SM3Det

Code is publicly available at https://github.com/zcablii/SM3Det. The benchmark dataset is available at www.kaggle.com/datasets/greatbird/soi-det. Pretrained weights initialization strategy for MoE experts is described (duplicating pretrained 1x1 conv weights).

📊 Experiments & Results

Evaluation Setup

Object detection across three diverse remote sensing datasets simultaneously.

Benchmarks:

DOTA-v1.0 (Optical Oriented Object Detection)
SARDet-100K (SAR Horizontal Object Detection)
DroneVehicle (Infrared/RGB Oriented Object Detection)

Metrics:

mAP (mean Average Precision)
Parameter count

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison against Single-Task (Separate Models) and Multi-Task (Dense) baselines showing SM3Det's superiority.
Combined M2Det Benchmark	mAP	72.85	74.30	+1.45
Combined M2Det Benchmark	mAP	71.95	74.30	+2.35
SARDet-100K	mAP	74.56	78.43	+3.87
DOTA-v1.0	mAP	72.48	73.20	+0.72
Ablation studies validating the contributions of MoE and DSO.
Combined M2Det Benchmark	mAP	71.95	73.57	+1.62
Combined M2Det Benchmark	mAP	73.57	74.30	+0.73

Experiment Figures

Visualization of the reweighting curves for the backbone learning rate in the DSO mechanism.

Main Takeaways

Unified training with SM3Det consistently outperforms specialized single-task models, challenging the assumption that diverse modalities must be trained separately.
Grid-level MoE is effective for remote sensing backbones, likely because it allows the model to handle different land-cover or sensor features at a local level.
The DSO mechanism successfully mitigates the 'negative transfer' often seen in multi-task learning by dynamically balancing learning rates, rather than just loss weights.

📚 Prerequisite Knowledge

Prerequisites

Object Detection architectures (Backbone, Neck, Head)
Mixture of Experts (MoE) concepts (gating, experts)
Multi-task learning challenges (gradient conflict, negative transfer)
Remote sensing modalities (SAR, Infrared, RGB)

Key Terms

M2Det: Multi-Modal Datasets and Multi-Task Object Detection—a task definition where a single model detects objects across unconnected datasets of different modalities.

SAR: Synthetic Aperture Radar—an imaging technique using radar waves, effective at night or through clouds but visually distinct from optical images.

MoE: Mixture of Experts—a neural network architecture where different parts (experts) are activated for different inputs to increase capacity without increasing inference cost.

Sparse MoE: A variant of MoE where only a small subset (top-k) of experts is activated for any given input, keeping computation low.

HBB: Horizontal Bounding Box—standard axis-aligned detection box.

OBB: Oriented Bounding Box—rotated detection box, crucial for aerial objects like ships or vehicles that aren't axis-aligned.

DSO: Dynamic Submodule Optimization—the proposed method to adjust learning rates per module to balance convergence speeds and directions.

Grid-level Experts: MoE experts applied to individual spatial positions in the feature map, rather than routing the whole image to one expert.

EMA: Exponential Moving Average—a statistical method used here to smooth historical loss values for stability.

KL Divergence: A statistical distance measure used in DSO to compare current loss distributions with historical ones to detect optimization instability.