LEPA: Learning Geometric Equivariance in Satellite Remote Sensing Data with a Predictive Architecture

📝 Paper Summary

Satellite Remote Sensing Foundation Models Geometric Equivariance

LEPA replaces unreliable geometric interpolation of satellite patch embeddings with a learned predictive model that accurately transforms embeddings to match user-defined areas of interest.

Core Problem

Standard interpolation methods (like bilinear interpolation) fail when applied to patch embeddings from foundation models because the embedding manifold is highly non-convex.

Why it matters:

Users often need to align precomputed satellite embeddings to specific geographic areas of interest that do not match the fixed precomputed grid.
Re-encoding large satellite images for every new alignment is computationally expensive and creates data-transfer bottlenecks.
Naive interpolation of embeddings results in unrealistic representations that degrade downstream performance.

Concrete Example: When rotating Prithvi-EO-2.0 patch embeddings by 90 degrees in latent space and reconstructing the image, the patches themselves are visible but their spatial relation is distorted. Naive bilinear interpolation of these vectors results in meaningless embeddings, yielding a Mean Reciprocal Rank (MRR) below 0.2.

Key Novelty

Learned Equivariance-Predicting Architecture (LEPA)

Instead of averaging embedding vectors to handle geometric transformations (rescaling, rotation, translation), LEPA trains a predictor to generate the correct transformed embedding given the original context and transformation parameters.
The model extends I-JEPA by conditioning the predictor on geometric augmentation parameters, effectively learning a 'world model' of how embeddings change under geometric shifts.

Architecture

The LEPA training architecture illustrating the Context Encoder, Predictor, and Target Encoder workflow.

Evaluation Highlights

Increases Mean Reciprocal Rank (MRR) for geometric adjustment from <0.2 (standard interpolation) to >0.8 (LEPA with fine-tuning).
Achieves competitive semantic segmentation performance on the PANGAEA benchmark using an ImageNet-pretrained I-JEPA, outperforming HLS-trained models on the MADOS dataset.
Demonstrates that standard interpolation methods break down for patch embeddings, while the learned predictor maintains high cosine similarity to target embeddings.

Breakthrough Assessment

7/10

Identifies a critical flaw in how embeddings are currently handled (interpolation) and provides a scalable, learned solution. The jump in MRR is substantial, though downstream segmentation gains are mixed.

⚙️ Technical Details

Problem Definition

Setting: Aligning precomputed patch embeddings from a fixed grid to a user-defined geometric area of interest.

Inputs: Context patch embeddings E(x) and transformation parameters t (translation, rotation, scaling).

Outputs: Predicted transformed embeddings E(T(x)) that approximate the embeddings of the geometrically transformed input image.

Pipeline Flow

Context Encoder (processes unmasked image regions)
Transformation Conditioning (injects geometric parameters)
Predictor (generates transformed target embeddings)
Target Encoder (generates ground truth embeddings for loss calculation)

System Modules

Context Encoder (Encoding)

Encodes the input image context into patch embeddings.

Model or implementation: ViT-base

Predictor

Predicts the embeddings of the target image (which is a geometric transformation of the context) conditioned on the transformation parameters.

Model or implementation: Transformer-based predictor

Target Encoder (Encoding)

Produces the ground truth embeddings from the geometrically transformed image to supervise the predictor.

Model or implementation: ViT-base (Exponential Moving Average of Context Encoder)

Novel Architectural Elements

Conditioned Predictor: The JEPA predictor is conditioned on explicit geometric transformation parameters (rotation, scale, translation) projected via an MLP.
Conditioned Positional Encodings: Height and width positional indices are centered around the image center to better reflect position changes under transformation.

Modeling

Base Model: ViT-base

Training Method: Self-supervised learning (I-JEPA style) with auxiliary geometric prediction task

Objective Functions:

Purpose: Minimize distance between predicted and actual embeddings.

Formally: L2 distance in embedding space between Predictor output and Target Encoder output.

Training Data:

ImageNet-1k
HLS (Harmonized Landsat-Sentinel) dataset

Key Hyperparameters:

epochs: 50
augmentations: Translation (x/y), Rotation, Scaling

Compute: Not reported in the paper

Comparison to Prior Work

vs. Prithvi-EO-2.0: LEPA uses a predictive architecture to handle geometric transforms, whereas Prithvi relies on standard interpolation (which fails).
vs. I-JEPA: LEPA conditions the predictor on geometric parameters to enforcing equivariance, whereas standard I-JEPA focuses on inpainting masking.
vs. DINOv2 [not cited in paper]: DINOv2 focuses on invariant features via distillation; LEPA explicitly models equivariance to geometric transformations.

Limitations

Interpolation using bilinear interpolation breaks down for patch embeddings.
Requires training a specific predictor, adding architectural complexity compared to simple interpolation.
HLS models do not benefit from CLS-tokens as much as ImageNet models, likely due to the lack of a central subject.
Finetuning positional encodings can sometimes reduce embedding equivariance.

Reproducibility

Code: https://github.com/embed2scale/LEPA

Code is publicly available at https://github.com/embed2scale/LEPA. The paper specifies the datasets (ImageNet-1k, HLS) and the base architecture (ViT-base). Preprocessing for HLS is described (temporal sequences, cloud filtering).

📊 Experiments & Results

Evaluation Setup

Evaluation of embedding quality via semantic segmentation benchmarks and equivariance via rank-based metrics.

Benchmarks:

PANGAEA (Semantic Segmentation (Multipurpose EO benchmark))
ImageNet-1k (Image Classification (used for pretraining/baseline))

Metrics:

Mean Reciprocal Rank (MRR)
Semantic Segmentation Score (Normalized)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Equivariance Test (HLS)	MRR	0.19	0.68	+0.49
Equivariance Test (HLS)	MRR	0.19	0.80	+0.61
Equivariance Test (ImageNet)	MRR	0.12	0.67	+0.55

Experiment Figures

Visual reconstruction of rotated and downsampled patch embeddings.

Comparison of embedding augmentations: Image Space vs. LEPA Predictor vs. Nearest Neighbor.

Main Takeaways

Standard interpolation (bilinear/nearest neighbor) is unsuitable for patch embeddings, yielding MRR scores < 0.2.
Conditioning a predictor on geometric transformations (LEPA) allows for accurate embedding adjustment without re-encoding, raising MRR to > 0.8.
ImageNet-pretrained I-JEPA models perform surprisingly well on EO tasks, outperforming HLS-trained models on specific datasets like MADOS.
A CLS-token improves equivariance for ImageNet models but can decrease it for HLS models, likely due to the distributed nature of geospatial semantic information.

📚 Prerequisite Knowledge

Prerequisites

ViT (Vision Transformer) architecture and patch embeddings
MAE (Masked Autoencoder) concepts
Self-supervised learning
Geometric transformations (affine transforms)

Key Terms

JEPA: Joint-Embedding Predictive Architecture—a self-supervised framework where a predictor tries to predict the embeddings of masked regions based on context embeddings.

I-JEPA: Image-based JEPA—a specific JEPA variant operating on image patches.

Equivariance: A property where transforming an input image (e.g., rotating it) results in a corresponding predictable transformation in the embedding space.

MRR: Mean Reciprocal Rank—a metric used here to evaluate equivariance by ranking the correct geometric transformation among a set of augmented embeddings based on cosine similarity.

Patch Embeddings: Vector representations of small, fixed-size square regions of an image (patches), produced by Vision Transformers.

HLS: Harmonized Landsat-Sentinel—a dataset combining imagery from Landsat and Sentinel satellites, commonly used for Earth observation.

Prithvi-EO-2.0: A specific foundational model for Earth observation data based on the MAE architecture.

CLS token: Classification token—a special token prepended to the input sequence in Transformers to aggregate global image information.