Self-Supervised Multi-Modal World Model with 4D Space-Time Embedding

📝 Paper Summary

Spatio-temporal representation learning Earth observation Ecological forecasting

DeepEarth introduces Earth4D, a planetary-scale 4D hash encoding that efficiently maps space-time coordinates to embeddings, enabling a multi-modal world model to forecast ecological states with high precision.

Core Problem

Existing Earth observation models struggle to efficiently scale high-resolution representations across both vast planetary spaces and long temporal periods while handling multi-modal data.

Why it matters:

Accurate ecological forecasting (e.g., wildfire risk) requires precise modeling of conditions that vary rapidly in both space and time.
Traditional positional encoders fail to handle the collision trade-offs inherent in mapping the entire Earth's surface over centuries at sub-meter resolution.
Current foundation models often require massive pre-training data yet still lack the specialized spatio-temporal resolution needed for localized predictions.

Concrete Example: When predicting Live Fuel Moisture Content (LFMC) for wildfire risk, a standard model might conflate measurements from the same location at different times. DeepEarth uses 4D hashing to distinctively encode (lat, lon, elev, time), correctly separating a wet winter measurement from a dry summer one at the exact same GPS coordinate.

Key Novelty

Earth4D: Learnable 4D Multi-Resolution Hash Encoding

Extends 3D hash encoding to 4D by concatenating features from one spatial grid (xyz) and three spatio-temporal grids (xyt, yzt, xzt), enabling efficient space-time indexing.
Integrates learned hash probing to dynamically resolve hash collisions, allowing the model to learn optimal memory allocation patterns rather than relying on static hashing.
Fuses these 4D embeddings with multi-modal data (vision, language) in a self-supervised autoencoder framework to learn unified Earth representations.

Architecture

System architecture of DeepEarth and the internal structure of the Earth4D encoder.

Evaluation Highlights

+35.0% improvement in R² (0.783 vs 0.58) on Live Fuel Moisture Content prediction by adding learned hash probing to standard hash encoding.
Surpasses the Galileo foundation model (MAE 11.7pp vs 12.6pp) using only coordinates and species embeddings, despite Galileo using massive multi-modal remote sensing data.
Achieves 99.3% parameter reduction (5M vs 800M) while maintaining high accuracy (R² 0.668), enabling efficient planetary-scale modeling.

Breakthrough Assessment

8/10

Significant architectural innovation in 4D encoding that outperforms larger foundation models on key ecological tasks with far less data/compute. The learned hashing for space-time collisions is a strong technical contribution.

⚙️ Technical Details

Problem Definition

Setting: Regression and reconstruction of ecological variables across continuous space-time coordinates

Inputs: Spatio-temporal coordinates (latitude, longitude, elevation, time) and optional multi-modal context (species type, sensors)

Outputs: Predicted ecological variable (e.g., Live Fuel Moisture Content percentage)

Pipeline Flow

Input Coordinates (x,y,z,t) → Earth4D Encoder
Input Metadata (e.g., Species) → Embedding Layer
Fusion (Concatenation) → MLP Decoder → Prediction

System Modules

Earth4D Encoder (Input Processing)

Map continuous space-time coordinates to high-dimensional embeddings

Model or implementation: 4D Multi-resolution Hash Encoding with Learned Probing

Species Embedder (Input Processing)

Encode categorical species data

Model or implementation: Learnable Embedding Layer

Predictor MLP

Map fused features to target variable

Model or implementation: Multi-Layer Perceptron

Novel Architectural Elements

Extension of 3D hash encoding to 4D via decomposed grids (xyz, xyt, yzt, xzt) to handle time efficiently
Integration of learned hash probing into the 4D encoder to dynamically resolve space-time collisions

Modeling

Base Model: DeepEarth (Custom Architecture with Earth4D Encoder)

Training Method: Supervised Regression (for LFMC task) / Self-Supervised Masked Reconstruction (general pre-training)

Objective Functions:

Purpose: Minimize error between predicted and actual moisture content.

Formally: Mean Absolute Error (MAE) or MSE loss on LFMC values.

Adaptation: Full model training

Trainable Parameters: Ranging from 5M (compressed) to 800M (dense baseline)

Training Data:

Globe-LFMC 2.0 dataset
Train/Test split: 76,467 / 13,297 samples

Key Hyperparameters:

embedding_dimension: 192
hash_capacity_dense: 2^{22}
hash_capacity_compressed: 2^{14}
+ 1 more
learned_probing: Enabled (reduces validation loss by 18% on RGB)

Compute: 4x training speedup compared to dense baseline; 93% memory reduction in compressed mode

Comparison to Prior Work

vs. Galileo: DeepEarth uses only coordinates+species vs. Galileo's heavy multi-modal inputs, yet DeepEarth achieves higher accuracy via better spatio-temporal indexing
vs. Instant-NGP: Extends 3D hashing to 4D and adds learned probing to handle collisions in planetary-scale data
vs. SatMAE [not cited in paper]: SatMAE uses masked autoencoding on satellite patches; DeepEarth encodes continuous space-time coordinates directly, avoiding patch-based discretization artifacts

Limitations

Currently evaluated primarily on a single ecological benchmark (LFMC)
Reliance on custom CUDA kernels may complicate deployment on non-NVIDIA hardware
Performance gain relies heavily on learned hash probing; standard hashing performs significantly worse

Reproducibility

Code: https://github.com/legel/deepearth

Code available at https://github.com/legel/deepearth. Models available for download. Relies on Globe-LFMC 2.0 dataset (open). Custom CUDA kernels required for Earth4D module.

📊 Experiments & Results

Evaluation Setup

Ecological forecasting regression task

Benchmarks:

Globe-LFMC 2.0 (Live Fuel Moisture Content Prediction (Regression))

Metrics:

Mean Absolute Error (MAE)
R-squared (R²)
Root Mean Square Error (RMSE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against the Galileo foundation model on the Globe-LFMC 2.0 benchmark.
Globe-LFMC 2.0	MAE (pp)	12.6	11.7	-0.9
Globe-LFMC 2.0	R²	0.72	0.783	+0.063
Ablation study demonstrating the critical impact of learned hash probing and model compression.
Globe-LFMC 2.0	MAE (pp)	16.6	11.7	-4.9
Globe-LFMC 2.0	R²	0.58	0.783	+0.203
Globe-LFMC 2.0	R²	0.58	0.668	+0.088

Main Takeaways

Earth4D outperforms the heavy foundation model Galileo on LFMC prediction despite using significantly less input data (coordinates vs. full remote sensing imagery).
Learned hash probing is critical: it turns a mediocre model (R² 0.58) into a state-of-the-art one (R² 0.783) by resolving hash collisions.
The architecture is highly efficient: a 5M parameter version outperforms an 800M baseline, offering 4x faster training and 93% memory reduction.

📚 Prerequisite Knowledge

Prerequisites

Multi-resolution hash encoding (Instant-NGP)
Self-supervised learning (masked autoencoders)
Transformer architectures
Geospatial coordinate systems

Key Terms

Earth4D: A novel 4D positional encoder that maps continuous space-time coordinates (x, y, z, t) to learnable embeddings using multi-resolution hash grids

hash collision: When different spatial coordinates map to the same index in a hash table due to limited memory; Earth4D solves this with learned probing

learned hash probing: A differentiable method that learns to select the optimal index from a set of candidate hash indices, reducing collision errors

LFMC: Live Fuel Moisture Content—the percentage of water in vegetation relative to dry weight, a key metric for wildfire risk

MAE: Mean Absolute Error—measure of prediction error

pp: percentage points—unit difference between two percentages

SAR: Synthetic Aperture Radar—active remote sensing that works day/night and through clouds

ERA-5: A global climate reanalysis dataset providing hourly weather estimates