← Back to Paper List

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

Yuxuan Li, Xiang Li, Yunheng Li, Yicheng Zhang, Yimian Dai, Qibin Hou, Ming-Ming Cheng, Jian Yang
Proceedings of the AAAI Conference on Artificial Intelligence (2024)
MM Benchmark

📝 Paper Summary

Remote Sensing Object Detection Multi-Task Learning Multi-Modal Learning
SM3Det is a unified remote sensing detection model that uses a grid-level sparse Mixture of Experts backbone and dynamic submodule optimization to jointly learn from diverse modalities and tasks without conflict.
Core Problem
Traditional remote sensing models are trained on single datasets/modalities, while unified training fails due to representation bottlenecks (crowded feature space) and optimization inconsistencies (varying learning difficulties) across disparate modalities like SAR, RGB, and Infrared.
Why it matters:
  • Airborne platforms (UAVs, satellites) carry multiple sensors, requiring simultaneous processing of diverse data streams rather than maintaining separate models for each.
  • Existing multi-source methods rely on strictly paired, spatially aligned images, which are scarce and inflexible for real-world applications.
  • Jointly training a single dense model on conflicting modalities often degrades performance compared to specialized models due to task interference.
Concrete Example: When a single dense model tries to learn both SAR (radar) and Optical (RGB) object detection simultaneously, the distinct pattern concepts (scattering vs. visual) cause feature interference, leading the unified model to underperform compared to two separate models trained individually.
Key Novelty
Grid-level Sparse MoE + Dynamic Submodule Optimization (DSO)
  • Integrates sparse Mixture of Experts (MoE) into the backbone at the spatial grid level, allowing the model to dynamically route local image patches to experts specialized for specific modalities or shared features.
  • Introduces Dynamic Submodule Optimization (DSO) to synchronize learning speeds by adjusting learning rates based on task loss history and preventing the shared backbone from overfitting to the hardest task at the expense of others.
Architecture
Architecture Figure Figure 2
The overall architecture of SM3Det, illustrating the shared backbone with MoE layers and the separate task heads.
Evaluation Highlights
  • Outperforms the Single-Task (specialized) baseline by +1.45% mAP on the unified M2Det benchmark consisting of DOTA, SARDet-100K, and DroneVehicle datasets.
  • Achieves 74.30% mAP on the combined benchmark using the Swin-T backbone, surpassing the dense Multi-Task baseline of 71.95%.
  • The lightweight version reduces parameters significantly while maintaining high performance, demonstrating better parameter efficiency than maintaining separate models.
Breakthrough Assessment
7/10
Strong engineering contribution defining a new practical task (M2Det) and effectively solving the modality-interference problem in unified models using MoE and optimization tricks. While MoE is known, the grid-level application to remote sensing backbones is novel.
×