MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

📝 Paper Summary

Earth Observation (EO) Remote Sensing Self-Supervised Learning (SSL)

MMEarth introduces a global-scale multi-modal Earth Observation dataset and a Multi-Pretext Masked Autoencoder (MP-MAE) that leverages diverse sensor data during pretraining to learn better representations for optical satellite imagery.

Core Problem

Self-supervised models trained on natural images (like ImageNet) or single-modality satellite data often fail to capture the complex semantic relationships needed for diverse Earth Observation tasks.

Why it matters:

Earth Observation applications (e.g., mapping carbon stocks, species abundance) critically lack labeled training data due to the high cost of expert field measurements.
Existing pretraining methods typically ignore the vast potential of automatically aligned multi-modal sensor data (radar, elevation, climate) available at global scale.
Models specialized for optical imagery struggle to generalize to tasks requiring understanding of physical properties not explicitly visible in RGB (e.g., canopy height or temperature).

Concrete Example: When predicting canopy height from an optical satellite image, a standard MAE trained only on optical data might miss structural cues. In contrast, MP-MAE pretrains by reconstructing hidden modalities like LiDAR-derived canopy height from the optical input, forcing the encoder to learn structural features explicitly.

Key Novelty

Multi-Pretext Masked Autoencoder (MP-MAE) on the MMEarth Dataset

Constructs MMEarth, a massive dataset aligning 12 modalities (optical, SAR, DEM, climate, etc.) across 1.2 million global locations, matched by space and time.
Proposes MP-MAE, which extends ConvNeXt V2 to simultaneously reconstruct multiple diverse modalities (not just RGB) from a masked optical input during pretraining.
Treats non-visual data (climate, location, time) as 'image-level' modalities to be predicted alongside 'pixel-level' maps, enriching the semantic representation of the optical encoder.

Architecture

The MP-MAE framework using a ConvNeXt V2 encoder and multi-head decoder.

Evaluation Highlights

+3.4% Top-1 accuracy on Sentinel-2 land cover classification (So2Sat) compared to ImageNet-pretrained baseline.
+5.7% mIoU on semantic segmentation (Neon Tree) compared to ImageNet-pretrained baseline.
Consistently outperforms domain-specific baselines (e.g., SatMAE, Satlas) on linear probing tasks, demonstrating superior feature generalization.

Breakthrough Assessment

8/10

Significant contribution in releasing a harmonized, ImageNet-scale multi-modal EO dataset. The MP-MAE method effectively demonstrates that predicting unseen modalities is a powerful pretext task for remote sensing.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised pretraining of an optical satellite image encoder using multi-modal pretext tasks

Inputs: Optical satellite image patches (Sentinel-2)

Outputs: Reconstructions of multiple co-located modalities (e.g., SAR, DEM, Climate variables)

Pipeline Flow

Input Processing (Masking & Encoding)
Decoder Processing (Multi-Head Reconstruction)

System Modules

Sparse Encoder

Encodes visible patches of the optical Sentinel-2 image into a latent representation

Model or implementation: ConvNeXt V2 (based on sparse convolutions)

Multi-Head Decoder

Reconstructs different modalities from the shared latent representation

Model or implementation: ConvNeXt V2 Decoder with modality-specific heads

Novel Architectural Elements

Extension of ConvNeXt V2 FCMAE to support multi-modal reconstruction targets (Pixel-level and Image-level)
Integration of scalar regression heads (for climate/location) alongside dense reconstruction heads in an MAE framework

Modeling

Base Model: ConvNeXt V2 (Base, Large, Huge variants explored)

Training Method: Multi-Pretext Masked Autoencoder (MP-MAE) pretraining followed by Fine-tuning or Linear Probing

Objective Functions:

Purpose: Reconstruct pixel-level modalities.

Formally: Mean Squared Error (MSE) between predicted and ground truth maps (computed only on masked patches for Sentinel-2, full image for others).
Purpose: Predict image-level scalar modalities.

Formally: MSE for continuous variables (Temp, Precip) and Cross-Entropy for categorical variables (Biome, Month).

Adaptation: Fine-tuning or Linear Probing on downstream tasks

Training Data:

MMEarth dataset: 1.2M locations, 128x128 pixels
Train/Val/Test splits not explicitly detailed in summary, but subsets MMEarth100k and MMEarth64 provided

Key Hyperparameters:

masking_ratio: 0.6
input_size: 128x128
patch_size: 32
+ 4 more
optimizer: AdamW
learning_rate: Not explicitly reported in the paper summary
batch_size: Not explicitly reported in the paper summary
epochs: 100

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. SatMAE: MP-MAE uses a Fully Convolutional architecture (ConvNeXt) instead of ViT, and reconstructs cross-modal targets (SAR, DEM) rather than just optical time-series.
vs. ImageNet-MAE: MP-MAE leverages domain-specific multi-modal data rather than natural images.
vs. Satlas: MP-MAE relies on fully automated alignment of modalities (including non-visible ones like climate) without human annotation.
+ 1 more
vs. 4M [not cited in paper]: MP-MAE focuses specifically on EO modalities and uses a convolutional backbone suited for variable input sizes, whereas 4M uses a token-based transformer approach for general modalities.

Limitations

Focuses only on optical Sentinel-2 images as input for downstream tasks (does not explore multi-modal inference).
Requires aligning massive multi-modal datasets, which can be storage-intensive.
Specifics of computational cost (GPU hours) for training on the full 1.2M dataset are not detailed.

Reproducibility

Code: https://github.com/vishalned/MMEarth-train

Dataset (MMEarth), data collection code, and MP-MAE training code are all publicly available. MMEarth includes subsets (100k, 64x64) to facilitate research with limited compute.

📊 Experiments & Results

Evaluation Setup

Pretraining on MMEarth followed by evaluation on GEO-Bench downstream tasks.

Benchmarks:

M-BigEarthNet (Multi-label classification)
M-So2Sat (Land cover classification)
M-Brick-Kiln (Binary classification (object detection proxy))
M-EuroSAT (Land cover classification)
M-CASHER (Semantic segmentation (Crop type))
M-SA-Crop-Type (Semantic segmentation (Crop type))
M-Neon-Tree (Semantic segmentation (Tree detection))

Metrics:

Top-1 Accuracy
mean Average Precision (mAP)
F1 Score
mean Intersection over Union (mIoU)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Linear probing results demonstrating the quality of learned representations without fine-tuning.
M-So2Sat	Top-1 Accuracy	53.8	57.2	+3.4
M-EuroSAT	Top-1 Accuracy	89.2	93.3	+4.1
M-Neon-Tree	mIoU	39.9	45.6	+5.7
Comparison against domain-specific baselines (SatMAE, Satlas, Scale-MAE).
M-BigEarthNet	mAP	73.6	78.4	+4.8
M-EuroSAT	Top-1 Accuracy	90.7	93.3	+2.6

Experiment Figures

Geographic and temporal distribution of the MMEarth dataset.

Main Takeaways

MP-MAE consistently outperforms ImageNet-pretrained baselines across all evaluated downstream tasks, validating the benefit of domain-specific pretraining.
Multi-modal pretext tasks are particularly effective for linear probing, suggesting they lead to more robust and semantically rich frozen representations compared to single-modality pretraining.
Reconstructing 'invisible' modalities (like Canopy Height or SAR) forces the optical encoder to learn structural and textural features that are critical for tasks like tree segmentation and land cover mapping.

📚 Prerequisite Knowledge

Prerequisites

Masked Autoencoders (MAE)
Convolutional Neural Networks (ConvNeXt)
Remote Sensing fundamentals (Optical vs. SAR, DEM)
Self-supervised learning concepts (pretext tasks, linear probing)

Key Terms

MAE: Masked Autoencoder—a self-supervised learning method that masks parts of an input and trains a model to reconstruct the missing parts

Sentinel-2: An optical Earth observation satellite mission providing multi-spectral imaging

Sentinel-1: A satellite mission providing Synthetic Aperture Radar (SAR) imaging, which captures surface texture and roughness independent of cloud cover

DEM: Digital Elevation Model—3D representation of terrain elevation

SAR: Synthetic Aperture Radar—active remote sensing technology using radar waves

FCMAE: Fully Convolutional Masked Autoencoder—an MAE variant using Convolutional Neural Networks (CNNs) instead of Transformers

MP-MAE: Multi-Pretext Masked Autoencoder—the proposed method extending FCMAE to reconstruct multiple modalities

MMEarth: The proposed dataset containing 1.2M locations with 12 aligned modalities

Linear Probing: Evaluating a pretrained encoder by freezing its weights and training a simple linear classifier on top

L2A/L1C: Sentinel-2 processing levels: L1C is Top-of-Atmosphere, L2A is Bottom-of-Atmosphere (atmospherically corrected)