DisasterM3: A Remote Sensing Vision-Language Dataset for Disaster Damage Assessment and Response

📝 Paper Summary

Remote Sensing Vision-Language Models (VLMs) Disaster Response

DisasterM3 is a large-scale, multi-sensor dataset designed to evaluate and train Vision-Language Models on complex disaster assessment tasks, ranging from damage counting to generating actionable rescue reports.

Core Problem

General-purpose and existing remote sensing VLMs fail in disaster scenarios due to lack of domain-specific data, inability to process multi-sensor inputs (SAR + Optical), and poor performance on fine-grained counting tasks.

Why it matters:

Disasters require rapid, accurate damage assessment (e.g., counting collapsed buildings) to guide rescue teams, which current generic models cannot reliably provide
Extreme weather often blocks optical sensors, necessitating the use of Synthetic Aperture Radar (SAR), but most VLMs are trained primarily on optical imagery
Current remote sensing datasets focus on general geospatial tasks (classification/captioning) rather than the complex reasoning and reporting needed for emergency response

Concrete Example: In an earthquake scene, generic VLMs like LLaVA-1.5 struggle to identify collapsed vs. intact buildings and fail to reason about safe rescue routes. When provided with a post-disaster SAR image (which looks like noise to untrained eyes), standard models perform poorly compared to optical images due to the modality gap.

Key Novelty

Multi-Hazard, Multi-Sensor, Multi-Task Disaster Benchmark

Curates 26,988 bi-temporal image pairs (pre/post-disaster) across 36 disaster events, integrating both Optical and Synthetic Aperture Radar (SAR) data to handle weather obstructions
Defines 9 distinct tasks ranging from basic recognition to complex reasoning (e.g., finding optimal rescue paths) and long-form report generation, creating a full pipeline for disaster response AI

Architecture

The dataset construction and annotation pipeline.

Evaluation Highlights

Fine-tuning Qwen2.5-VL-7B on DisasterM3 yields up to +10.4% improvement on Question Answering tasks compared to the base model.
For Referring Segmentation, fine-tuned PSALM achieves +40.8% improvement in mIoU (mean Intersection over Union) on optical data compared to the baseline.
Integrating SAR imagery remains a challenge: performance on post-disaster SAR images is significantly lower than optical images (e.g., ~38% vs ~64% accuracy for Qwen2.5-VL-7B), highlighting the cross-sensor gap.

Breakthrough Assessment

8/10

Significant contribution to the specialized domain of disaster response. The inclusion of SAR data and complex reasoning tasks (like report generation) pushes the boundary for Remote Sensing VLMs beyond simple classification.

⚙️ Technical Details

Problem Definition

Setting: Multi-task vision-language understanding for disaster scenes using bi-temporal (pre/post) and multi-sensor (Optical/SAR) satellite imagery.

Inputs: Image I (Optical or SAR, pre or post-disaster) and textual instruction T.

Outputs: Text response R (for QA/Captioning) or Segmentation Mask M (for Referring Segmentation).

Pipeline Flow

Data Collection (Optical/SAR images)
Annotation Pipeline (Experts + GPT-4o)
Fine-tuning (Qwen2.5-VL, InternVL3, LISA, PSALM)
Evaluation (9 distinct tasks)

System Modules

Data Processor

Pre-processes and registers bi-temporal Optical and SAR images to 0.8m resolution

Model or implementation: Standard image processing tools (ArcGIS, etc.)

Instruction Generator

Generates diverse textual instructions for 9 tasks

Model or implementation: GPT-4o (assisted by experts)

VLM Backbone

Processes image-text pairs to generate answers or masks

Model or implementation: Varied (Qwen2.5-VL, InternVL3, LISA, PSALM)

Novel Architectural Elements

Multi-modal input integration (Optical + SAR) within a unified instruction-tuning framework for disaster tasks [data-centric innovation rather than pure architecture]

Modeling

Base Model: Qwen2.5-VL-7B and InternVL3-8B (for QA); LISA and PSALM (for Segmentation)

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: LoRA (Low-Rank Adaptation) used for InternVL3; full fine-tuning or LoRA likely for others (specifics implied)

Trainable Parameters: Not explicitly detailed for all models, but LoRA is mentioned for InternVL3

Training Data:

Instruct set: 17,190 Optical, 3,798 SAR images, 92,968 instruction pairs
Bench set: 5,024 Optical, 976 SAR images, 30,042 instruction pairs

Key Hyperparameters:

learning_rate: 2e-5 (InternVL3), 1e-5 (Qwen2.5-VL)
batch_size: 128 (InternVL3), 16 (Qwen2.5-VL)
epochs: 10 (InternVL3), 2 (Qwen2.5-VL)
+ 1 more
max_length: 4096 (InternVL3), 1536 (Qwen2.5-VL)

Compute: 8x NVIDIA A100 GPUs (InternVL3), 8x NVIDIA H800 GPUs (Qwen2.5-VL)

Comparison to Prior Work

vs. GeoChat: DisasterM3 focuses on disaster-specific tasks (damage counting, rescue reasoning) and includes SAR data, whereas GeoChat is for general RS tasks.
vs. FloodNet: DisasterM3 covers 10 disaster types and 9 tasks, while FloodNet is limited to floods and simple VQA.
vs. xBD: DisasterM3 adds language instructions and reasoning tasks to the visual damage classification of xBD.

Limitations

Significant performance drop on SAR imagery compared to Optical, indicating a persistent modality gap.
Fine-tuned models (e.g., InternVL3) show signs of overfitting on counting tasks for sparse/dense extremes.
Imbalanced sample distribution for certain damage types (e.g., destroyed vs. intact).

Reproducibility

Code: https://github.com/Junjue-Wang/DisasterM3

Dataset and code are publicly available at https://github.com/Junjue-Wang/DisasterM3. The paper provides detailed data composition and fine-tuning hyperparameters.

📊 Experiments & Results

Evaluation Setup

Zero-shot and Fine-tuned evaluation on the DisasterM3 Bench set.

Benchmarks:

DisasterM3 Bench (Multi-task evaluation (Recognition, Counting, Reasoning, Reporting)) [New]

Metrics:

Accuracy (%) for multiple-choice QA
GPT-4 score (1-5) for open-ended reports
mIoU and cIoU for Referring Segmentation
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fine-tuning generic VLMs on DisasterM3 leads to substantial gains across QA tasks.
DisasterM3 Bench	Average Accuracy	58.1	68.5	+10.4
DisasterM3 Bench	Average Accuracy	60.4	68.9	+8.5
Referring segmentation models also see massive improvements after fine-tuning.
DisasterM3 Bench	mIoU	11.1	51.9	+40.8
DisasterM3 Bench	mIoU	4.4	42.0	+37.6
The gap between Optical and SAR performance remains large, even after fine-tuning.
DisasterM3 Bench	Average Accuracy	68.5	38.6	-29.9

Experiment Figures

Impact of prompt variation and building density on model accuracy.

Main Takeaways

Current state-of-the-art VLMs (even large ones like GPT-4o) struggle with specialized disaster tasks, particularly counting and reasoning.
Fine-tuning on DisasterM3 provides a strong baseline, significantly improving performance and stability across varied prompts.
There is a critical need for better multi-modal alignment strategies to effectively utilize SAR imagery, which is crucial for all-weather disaster response.
The dataset exposes model biases, such as performance degradation in very dense or very sparse damage scenarios (overfitting risks).

📚 Prerequisite Knowledge

Prerequisites

Basics of Vision-Language Models (VLMs)
Remote Sensing imagery characteristics (Optical vs. SAR)
Standard evaluation metrics for segmentation (IoU) and text generation

Key Terms

SAR: Synthetic Aperture Radar—an active remote sensing technology that creates images by bouncing radar signals off the earth; unlike optical cameras, it sees through clouds and at night

Bi-temporal: Using two images of the same location taken at different times (pre-disaster and post-disaster) to identify changes

mIoU: mean Intersection over Union—a standard metric for measuring the accuracy of an object detector or segmenter; 100% means perfect overlap with ground truth

cIoU: cumulative Intersection over Union—variant of IoU used here for evaluating referring segmentation performance

Grounding: The ability of a model to link textual concepts (e.g., 'damaged building') to specific pixels or regions in an image

Referring Segmentation: A task where the model must segment a specific object in an image described by a natural language expression

Optical imagery: Standard satellite photos capturing visible light (like a camera)

VLM: Vision-Language Model—AI models trained to understand and generate content based on both images and text

Instruction Tuning: Fine-tuning a model on pairs of (instruction, output) to improve its ability to follow user commands