Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment

📝 Paper Summary

Remote Sensing Vision-Language Models (VLMs)

GRAFT trains open-vocabulary remote sensing models without text annotations by using geotagged ground-level internet images to bridge satellite imagery with CLIP's pre-trained text-image space.

Core Problem

Training open-vocabulary vision-language models for satellite imagery is difficult because massive paired text-image datasets (like those for internet images) do not exist for remote sensing.

Why it matters:

Existing remote sensing models are specialized for pre-defined concepts, limiting flexibility for analysts who need to query novel objects (e.g., 'farmlands in Massachusetts') without retraining.
Manual annotation of satellite imagery is expensive and requires expertise, resulting in datasets four orders of magnitude smaller than internet-scale data (10k vs 400 million pairs).
Current methods relying on small captioned datasets or direct CLIP fine-tuning fail to generalize effectively to the diverse, unannotated nature of global satellite data.

Concrete Example: An analyst wants to find 'baseball fields' in a city. A traditional supervised model trained only on 'buildings' and 'roads' cannot do this. A standard CLIP model fails because satellite viewpoints look nothing like ground photos. GRAFT solves this by learning that the satellite view of a field corresponds to ground photos of baseball games, which CLIP already understands.

Key Novelty

Ground images as a semantic bridge (GRAFT)

Uses geotagged internet images (ground view) as an intermediary to connect satellite images (overhead view) to language, avoiding the need for direct satellite-text pairs.
Aligns a satellite image encoder to the CLIP image space by pulling satellite embeddings closer to the embeddings of ground images taken at the same location.
Extends alignment to the pixel level by mapping specific ground image locations to corresponding patches in the satellite view, enabling localization and segmentation.

Architecture

The training pipeline for GRAFT showing how satellite images are aligned with ground images.

Evaluation Highlights

Outperforms supervised VLMs by up to 20% on zero-shot image classification tasks.
Achieves >80% relative improvement over baselines on text-to-segmentation benchmarks.
Demonstrates state-of-the-art zero-shot performance on various text-to-image retrieval benchmarks for satellite imagery.

Breakthrough Assessment

8/10

Significantly shifts the paradigm for remote sensing by removing the bottleneck of textual annotations. The massive performance gains (+80% segmentation) and scalable data collection method suggest high impact.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot open-vocabulary recognition (classification, retrieval, segmentation) on satellite imagery without paired text training data.

Inputs: Satellite image s and a natural language query/concept t.

Outputs: Relevance score, classification label, or segmentation mask indicating where concept t appears in s.

Pipeline Flow

Input Processing: Satellite Image + Ground Images (Training)
Alignment: Image-Level or Pixel-Level Encoder (Training)
Inference: Satellite Encoder + CLIP Text Encoder (Zero-shot)

System Modules

Satellite Encoder

Maps satellite images into the CLIP feature space.

Model or implementation: ViT-B/16 (initialized with CLIP weights)

CLIP Text Encoder

Encodes natural language queries into the shared feature space to retrieve or classify satellite images.

Model or implementation: CLIP (frozen)

SAM (Segment Anything Model)

Refines patch-level localizations into precise segmentation masks.

Model or implementation: SAM

ViperGPT (Modified)

Parses complex questions into API calls that utilize the GRAFT pixel-level model.

Model or implementation: ViperGPT + GRAFT Detector

Novel Architectural Elements

Intermediary alignment topology: Satellite Encoder is trained against a *set* of ground image embeddings rather than a single paired caption.
Many-to-one contrastive loss structure: Matches one satellite image to multiple ground images (N_i) captured within its footprint.

Modeling

Base Model: ViT-B/16 (initialized from CLIP)

Training Method: Contrastive Learning with Many-to-One matching (Satellite-to-Ground)

Objective Functions:

Purpose: Align satellite embeddings with all co-located ground images while pushing away ground images from other locations.

Formally: InfoNCE-style loss where numerator sums exponentiated similarity over all N_i ground images for satellite image s_i.

Adaptation: Full fine-tuning of the Satellite Encoder

Training Data:

NAIP dataset: 10.2 million ground-satellite pairs (1m resolution)
Sentinel-2 dataset: 8.7 million ground-satellite pairs (10m resolution)
Ground images sourced from Flickr (filtered for outdoor), Satellite images centered on geotags.

Key Hyperparameters:

satellite_image_size: 224x224 (training), 448x448 (downloaded for augmentation)

Compute: Not reported in the paper

Comparison to Prior Work

vs. RemoteCLIP/GeoRSCLIP: GRAFT uses no text annotations, relying solely on ground images as a bridge.
vs. SatlasPretrain: GRAFT supports open-vocabulary tasks (zero-shot), whereas SatlasPretrain is a vision-only backbone.
vs. SeCo [not cited in paper]: SeCo uses temporal contrastive learning on satellite images only; GRAFT uses cross-view ground-satellite contrastive learning.

Limitations

Relies on the availability and density of geotagged internet images (Flickr), which may be sparse in remote or non-populous areas.
Temporal misalignment between ground and satellite images (especially for NAIP with 2-year revisit cycles) can introduce noise.
The pixel-level supervision is sparse (loss computed only at pixels with ground images), potentially limiting dense feature learning.

Reproducibility

Data collection methodology using Flickr and EarthEngine is described in detail. Specific hyperparameters for training (learning rate, batch size) are mentioned generally as being selected via a collected validation set but specific values are not explicitly listed in the main text. Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on classification, retrieval, segmentation, and VQA.

Benchmarks:

FMoW (Functional Map of the World) (Image Classification / Retrieval)
MillionAID (Image Classification / Retrieval)
PatternNet (Image Classification)
OpenStreetMaps (OSM) Features (Semantic Segmentation) [New]

Metrics:

Top-1 Accuracy
Mean Average Precision (mAP)
Recall@K
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GRAFT outperforms supervised and self-supervised baselines on zero-shot classification and retrieval tasks.
MillionAID	Zero-shot Classification Accuracy	43.5	52.4	+8.9
FMoW-RGB	Zero-shot Classification Accuracy	20.1	24.5	+4.4
PatternNet	Zero-shot Classification Accuracy	57.7	68.3	+10.6
Pixel-level GRAFT combined with SAM shows massive improvements in zero-shot segmentation.
OSM Segmentation (NAIP)	mAP	7.2	13.2	+6.0

Experiment Figures

Qualitative results of GRAFT on classification, retrieval, segmentation, and VQA.

Main Takeaways

GRAFT effectively aligns satellite imagery with language without using any text annotations during training, solely via ground images.
The method scales well to different resolutions (1m NAIP and 10m Sentinel-2).
Pixel-level alignment enables precise localization, significantly boosting segmentation performance when coupled with SAM compared to using standard CLIP + SAM.
The approach generalizes to VQA tasks by plugging into the ViperGPT framework, allowing for reasoning about spatial features.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Contrastive Learning (CLIP)
Basic knowledge of Remote Sensing (satellite imagery, resolutions like NAIP/Sentinel-2)
Familiarity with Vision Transformers (ViT)

Key Terms

GRAFT: Ground Remote Alignment for Training—the proposed method of aligning satellite images to CLIP space via co-located ground images.

CLIP: Contrastive Language-Image Pre-training—a foundation model trained on internet image-text pairs that learns a shared embedding space for images and text.

ViT: Vision Transformer—a neural network architecture that processes images as sequences of patches using self-attention mechanisms.

NAIP: National Agriculture Imagery Program—high-resolution (1m/pixel) aerial imagery covering the continental United States.

Sentinel-2: A satellite mission providing lower resolution (10m/pixel) global optical imagery.

SAM: Segment Anything Model—a foundational image segmentation model that can generate masks from point prompts.

VQA: Visual Question Answering—the task of answering natural language questions about the visual content of an image.

geotag: Metadata embedded in an image file indicating the precise latitude and longitude where the photo was taken.

ViperGPT: A framework that uses Large Language Models to generate code (programs) that call vision APIs to answer visual queries.

zero-shot: The ability of a model to perform a task (like classifying a 'stadium') without having seen explicit labeled examples of that specific class during training.