RoboFusion: Towards Robust Multi-Modal 3D obiect Detection via SAM

📝 Paper Summary

Multi-modal 3D Object Detection Robustness in Autonomous Driving Visual Foundation Models (VFMs)

RoboFusion leverages the generalization of the Segment Anything Model (SAM) to robustify multi-modal 3D object detection against severe weather and sensor noise without relying on domain adaptation.

Core Problem

State-of-the-art multi-modal 3D detectors trained on 'clean' datasets fail to generalize to real-world out-of-distribution (OOD) scenarios involving severe weather (snow, fog) and sensor noise.

Why it matters:

Current methods achieve high performance on sunny benchmark datasets but degrade significantly in harsh environmental conditions common in real-world driving.
Existing domain adaptation techniques suffer from domain shift limitations and overfitting risks, struggling when target domain differences are significant.

Concrete Example: A detector trained on sunny KITTI data may miss cars entirely in heavy snow or fog because the image features are corrupted. RoboFusion uses SAM's robust features to maintain detection capability even when visual inputs are degraded by weather noise.

Key Novelty

SAM-driven Robust Adaptation (RoboFusion)

Adapts the Segment Anything Model (SAM) for autonomous driving (SAM-AD) and uses its image encoder to extract robust visual features that generalize better to noise.
Employs a Depth-Guided Wavelet Attention (DGWA) module to decompose features into frequency subbands, allowing the system to filter out high-frequency noise while preserving structure.
Uses an Adaptive Fusion mechanism that re-weights features based on self-attention to dynamically suppress modalities that are more heavily corrupted by noise.

Evaluation Highlights

+6.51% mAP improvement on KITTI-C (Corrupted) benchmark compared to the TransFusion baseline.
+5.7% NDS improvement on nuScenes-C (Corrupted) benchmark compared to the TransFusion baseline.
Achieves SOTA performance on noisy datasets while maintaining competitive performance on clean datasets.

Breakthrough Assessment

7/10

Strong application of foundation models to improve robustness in 3D detection. The combination of SAM, wavelet denoising, and adaptive fusion is a novel architectural recipe for OOD resilience.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal 3D object detection under Out-of-Distribution (OOD) noise conditions (e.g., weather corruptions)

Inputs: LiDAR point clouds P and Camera images I (potentially corrupted by noise/weather)

Outputs: 3D Bounding Boxes (location, size, orientation, class)

Pipeline Flow

Image Branch: SAM-AD Encoder → AD-FPN → DGWA Denoising
LiDAR Branch: Voxel/Point Encoder
Fusion: Adaptive Fusion Module → Detection Head

System Modules

SAM-AD Encoder (Image Feature Extraction)

Extract robust visual embeddings from input images using a pre-trained foundation model

Model or implementation: ViT-based Image Encoder from SAM (pre-trained on AD datasets)

AD-FPN (Image Feature Extraction)

Upsample SAM's single-scale embedding into multi-scale features suitable for detection

Model or implementation: Feature Pyramid Network adapted for ViT features

DGWA (Depth-Guided Wavelet Attention)

Denoise image features using depth priors and frequency decomposition

Model or implementation: Wavelet Transform + Attention + Depth Encoder

Adaptive Fusion

Fuse LiDAR and Image features with dynamic re-weighting to suppress noisy modalities

Model or implementation: Self-Attention Mechanism

Novel Architectural Elements

AD-FPN: Designed to bridge the gap between SAM's single-scale ViT output and multi-scale detection requirements
DGWA: Integration of discrete wavelet transforms within the feature extraction pipeline to explicitly filter high-frequency noise
Adaptive Fusion: Self-attention based re-weighting specifically designed to handle imbalanced corruption across modalities

Modeling

Base Model: SAM (Segment Anything Model) Image Encoder (ViT-based)

Training Method: Pre-training SAM on AD datasets (SAM-AD) via Masked Auto-Encoding (MAE), followed by detection training

Objective Functions:

Purpose: Pre-train SAM on driving data to adapt domain.

Formally: Reconstruction loss (MAE style) on masked image patches.
Purpose: Train object detector.

Formally: Standard 3D detection losses (e.g., Heatmap loss, Regression loss) inherited from base detectors like TransFusion.

Training Data:

Pre-training: KITTI and nuScenes images + generated noise images (rain, snow, fog, sunlight)
Detection Training: Standard KITTI and nuScenes training splits

Key Hyperparameters:

mask_ratio: 0.75 (for SAM-AD pre-training)
pre_training_epochs: 400
gpus: 8 NVIDIA A100

Compute: Pre-training: 400 epochs on 8 NVIDIA A100 GPUs

Comparison to Prior Work

vs. TransFusion/BEVFusion: RoboFusion explicitly targets OOD noise robustness using VFM features and wavelet denoising, whereas baselines optimize for clean data performance.
vs. SAM3D: RoboFusion is multi-modal and integrates SAM features deeply, whereas SAM3D is LiDAR-only and only uses SAM for 2D-to-3D projection [not cited in paper as direct baseline, but mentioned in related work].

Limitations

Computational cost of using a foundation model (SAM) as a feature extractor during inference is likely higher than standard backbones (not explicitly quantified in inference time).
Relies on the quality of synthetic noise generation for the pre-training phase (SAM-AD).

Reproducibility

Code: https://github.com/adept-thu/RoboFusion

Code is publicly available at https://github.com/adept-thu/RoboFusion. The paper describes the pre-training process for SAM-AD using YOLOv8 for FastSAM and MAE for standard SAM.

📊 Experiments & Results

Evaluation Setup

Evaluation on corrupted datasets (KITTI-C, nuScenes-C) containing synthetic weather and sensor noise.

Benchmarks:

KITTI-C (3D Object Detection under corruption)
nuScenes-C (3D Object Detection under corruption)

Metrics:

mAP (Mean Average Precision)
NDS (NuScenes Detection Score)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
RoboFusion demonstrates superior robustness on the KITTI-C benchmark compared to state-of-the-art baselines.
KITTI-C	mAP (Moderate)	59.39	65.90	+6.51
KITTI-C	mAP (Moderate)	60.42	65.90	+5.48
RoboFusion maintains SOTA performance on the nuScenes-C benchmark.
nuScenes-C	NDS	49.6	55.3	+5.7
nuScenes-C	mAP	26.9	35.8	+8.9

Main Takeaways

Standard SOTA methods (TransFusion, BEVFusion) degrade significantly in noisy/weather conditions.
RoboFusion successfully mitigates this degradation, achieving SOTA on corrupted benchmarks (KITTI-C, nuScenes-C).
The combination of SAM features, wavelet denoising, and adaptive fusion provides a robust solution for OOD scenarios without requiring explicit domain adaptation techniques.

📚 Prerequisite Knowledge

Prerequisites

Multi-modal 3D Object Detection architectures (e.g., TransFusion, BEVFusion)
Visual Foundation Models (ViT, SAM)
Wavelet Transforms (DWT/IDWT) for signal processing
Attention mechanisms

Key Terms

SAM: Segment Anything Model—a visual foundation model trained on 11 million images, known for zero-shot generalization

SAM-AD: A version of SAM pre-trained by the authors on autonomous driving datasets (KITTI, nuScenes) using masked auto-encoding

OOD: Out-of-Distribution—scenarios significantly different from the training data, such as severe weather for a model trained on sunny days

mAP: Mean Average Precision—a key metric for object detection accuracy

NDS: NuScenes Detection Score—a composite metric for 3D detection accuracy including translation, scale, orientation, and attribute errors

DWT: Discrete Wavelet Transform—a mathematical tool that decomposes a signal into low-frequency (coarse) and high-frequency (detail/noise) components

FPN: Feature Pyramid Network—a structure that generates multi-scale feature maps from a single input resolution

ViT: Vision Transformer—a transformer-based architecture for computer vision tasks

KITTI-C: A corrupted version of the KITTI dataset with synthetically added weather and sensor noise for robustness testing

nuScenes-C: A corrupted version of the nuScenes dataset with synthetically added weather and sensor noise