← Back to Paper List

RoboFusion: Towards Robust Multi-Modal 3D obiect Detection via SAM

Ziying Song, Guoxin Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Caiyan Jia, Feiyang Jia, Li Wang
Beijing Jiaotong University, Hebei University of Science and Technology, Tsinghua University, University of Macau, Beijing Institute of Technology
International Joint Conference on Artificial Intelligence (2024)
MM Benchmark Pretraining

📝 Paper Summary

Multi-modal 3D Object Detection Robustness in Autonomous Driving Visual Foundation Models (VFMs)
RoboFusion leverages the generalization of the Segment Anything Model (SAM) to robustify multi-modal 3D object detection against severe weather and sensor noise without relying on domain adaptation.
Core Problem
State-of-the-art multi-modal 3D detectors trained on 'clean' datasets fail to generalize to real-world out-of-distribution (OOD) scenarios involving severe weather (snow, fog) and sensor noise.
Why it matters:
  • Current methods achieve high performance on sunny benchmark datasets but degrade significantly in harsh environmental conditions common in real-world driving.
  • Existing domain adaptation techniques suffer from domain shift limitations and overfitting risks, struggling when target domain differences are significant.
Concrete Example: A detector trained on sunny KITTI data may miss cars entirely in heavy snow or fog because the image features are corrupted. RoboFusion uses SAM's robust features to maintain detection capability even when visual inputs are degraded by weather noise.
Key Novelty
SAM-driven Robust Adaptation (RoboFusion)
  • Adapts the Segment Anything Model (SAM) for autonomous driving (SAM-AD) and uses its image encoder to extract robust visual features that generalize better to noise.
  • Employs a Depth-Guided Wavelet Attention (DGWA) module to decompose features into frequency subbands, allowing the system to filter out high-frequency noise while preserving structure.
  • Uses an Adaptive Fusion mechanism that re-weights features based on self-attention to dynamically suppress modalities that are more heavily corrupted by noise.
Evaluation Highlights
  • +6.51% mAP improvement on KITTI-C (Corrupted) benchmark compared to the TransFusion baseline.
  • +5.7% NDS improvement on nuScenes-C (Corrupted) benchmark compared to the TransFusion baseline.
  • Achieves SOTA performance on noisy datasets while maintaining competitive performance on clean datasets.
Breakthrough Assessment
7/10
Strong application of foundation models to improve robustness in 3D detection. The combination of SAM, wavelet denoising, and adaptive fusion is a novel architectural recipe for OOD resilience.
×