← Back to Paper List

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

Vishal Nedungadi, A. Kariryaa, Stefan Oehmcke, Serge J. Belongie, Christian Igel, Nico Lang
University of Copenhagen, Pioneer Centre for AI
European Conference on Computer Vision (2024)
MM Pretraining Benchmark

📝 Paper Summary

Earth Observation (EO) Remote Sensing Self-Supervised Learning (SSL)
MMEarth introduces a global-scale multi-modal Earth Observation dataset and a Multi-Pretext Masked Autoencoder (MP-MAE) that leverages diverse sensor data during pretraining to learn better representations for optical satellite imagery.
Core Problem
Self-supervised models trained on natural images (like ImageNet) or single-modality satellite data often fail to capture the complex semantic relationships needed for diverse Earth Observation tasks.
Why it matters:
  • Earth Observation applications (e.g., mapping carbon stocks, species abundance) critically lack labeled training data due to the high cost of expert field measurements.
  • Existing pretraining methods typically ignore the vast potential of automatically aligned multi-modal sensor data (radar, elevation, climate) available at global scale.
  • Models specialized for optical imagery struggle to generalize to tasks requiring understanding of physical properties not explicitly visible in RGB (e.g., canopy height or temperature).
Concrete Example: When predicting canopy height from an optical satellite image, a standard MAE trained only on optical data might miss structural cues. In contrast, MP-MAE pretrains by reconstructing hidden modalities like LiDAR-derived canopy height from the optical input, forcing the encoder to learn structural features explicitly.
Key Novelty
Multi-Pretext Masked Autoencoder (MP-MAE) on the MMEarth Dataset
  • Constructs MMEarth, a massive dataset aligning 12 modalities (optical, SAR, DEM, climate, etc.) across 1.2 million global locations, matched by space and time.
  • Proposes MP-MAE, which extends ConvNeXt V2 to simultaneously reconstruct multiple diverse modalities (not just RGB) from a masked optical input during pretraining.
  • Treats non-visual data (climate, location, time) as 'image-level' modalities to be predicted alongside 'pixel-level' maps, enriching the semantic representation of the optical encoder.
Architecture
Architecture Figure Not explicitly labeled as Figure 1 in summary text, but implied architecture description exists.
The MP-MAE framework using a ConvNeXt V2 encoder and multi-head decoder.
Evaluation Highlights
  • +3.4% Top-1 accuracy on Sentinel-2 land cover classification (So2Sat) compared to ImageNet-pretrained baseline.
  • +5.7% mIoU on semantic segmentation (Neon Tree) compared to ImageNet-pretrained baseline.
  • Consistently outperforms domain-specific baselines (e.g., SatMAE, Satlas) on linear probing tasks, demonstrating superior feature generalization.
Breakthrough Assessment
8/10
Significant contribution in releasing a harmonized, ImageNet-scale multi-modal EO dataset. The MP-MAE method effectively demonstrates that predicting unseen modalities is a powerful pretext task for remote sensing.
×