SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing

📝 Paper Summary

Remote Sensing Vision-Language Models (VLMs)

SkyScript is a large-scale dataset created by automatically linking satellite imagery with OpenStreetMap tags via geo-coordinates, enabling a remote sensing VLM that outperforms baselines in zero-shot classification.

Core Problem

Remote sensing lacks large, semantically diverse image-text datasets because satellite imagery cannot be crawled from the web like natural images, and manual annotation is expert-intensive and limited in scale.

Why it matters:

Existing remote sensing datasets are small (<1 million images) and cover few classes (<150), limiting the development of versatile foundation models
Standard web-crawling methods (used for LAION) fail because satellite images are proprietary and rarely surrounded by relevant descriptive text on the internet
The lack of diverse training data hinders the application of VLMs to critical tasks like climate change mitigation and infrastructure monitoring

Concrete Example: A 'power pole' might be identifiable in a 0.1m resolution image but not in a 10m image; existing datasets lack the fine-grained metadata to distinguish visually groundable attributes from non-visual ones, leading to noisy or sparse training signals.

Key Novelty

Geo-Coordinate Data Mining from OpenStreetMap (OSM)

Constructs image-text pairs by matching open satellite imagery (Google Earth Engine) with crowdsourced geographic tags (OSM) based on exact location coordinates rather than web text crawling
Implements a 'visual groundability' filter that uses a logistic regression model on CLIP embeddings to determine if a semantic tag (e.g., 'road surface: asphalt') is visible at the image's specific resolution

Architecture

The data construction pipeline: linking GEE images with OSM tags via geo-coordinates

Evaluation Highlights

Achieves +6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets compared to baseline CLIP models
Constructs a dataset of 2.6 million filtered image-text pairs covering 29,000 distinct semantic tags (two orders of magnitude richer than prior datasets)
Demonstrates zero-shot transfer capabilities for fine-grained object attribute classification (e.g., road surface materials) and cross-modal retrieval

Breakthrough Assessment

8/10

Significantly addresses the data scarcity bottleneck in remote sensing by proposing a scalable, automated pipeline that generates 2.6M pairs with rich semantics, unlocking effective zero-shot capabilities.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot transfer learning for remote sensing tasks using a Vision-Language Model pre-trained on noisy image-text pairs

Inputs: Remote sensing image I

Outputs: Predicted semantic label or retrieved text description T

Pipeline Flow

Image Encoder (extracts visual features)
Text Encoder (extracts semantic features from captions)
Contrastive Alignment (computes similarity)

System Modules

Image Encoder (Input Processing)

Encode remote sensing images into a high-dimensional vector space

Model or implementation: ViT-L/14 (CLIP variant)

Text Encoder (Input Processing)

Encode text captions (derived from OSM tags) into the same vector space

Model or implementation: Transformer (CLIP text encoder)

Modeling

Base Model: CLIP (ViT-L/14)

Training Method: Continual pre-training (contrastive learning)

Objective Functions:

Purpose: Align image and text representations.

Formally: Contrastive loss maximizing cosine similarity between correct image-text pairs while minimizing it for incorrect pairs.

Training Data:

Images sourced from Google Earth Engine (collections: NAIP, USDA, USGS, Skysat, Sentinel-2)
Semantics sourced from OpenStreetMap (OSM) tags
Tag filtering: Logistic regression on CLIP embeddings to predict if a tag is visually groundable and at what GSD
Caption generation: Rule-based assembly of keys and values (e.g., 'key of value')
Noise filtering: Retain top 50% pairs based on CLIP cosine similarity

Key Hyperparameters:

noise_filtering_threshold: Top 50%
GSD_range: 0.1 m to 30 m
tag_accuracy_sample: 96.1%

Compute: Not reported in the paper

Comparison to Prior Work

vs. RemoteCLIP: SkyScript uses wild, open OSM tags (29K classes) rather than captions derived from limited existing datasets (<150 classes)
vs. LAION-RS (subset): SkyScript uses geo-coordinates for reliable matching, whereas web-crawled RS data is sparse (only 0.03% of LAION-2B) and often lacks relevant text
vs. SeCo [not cited in paper]: SkyScript uses image-text pairs for VLM training, whereas SeCo uses temporal/location contrastive learning on images only

Limitations

High-resolution images (<1m) are concentrated in the U.S. and Europe due to data availability, leading to geographic bias
OSM annotations are less complete in developing countries, underrepresenting those regions
Captions are generated via simple rule-based templates, which may lack natural language diversity compared to human-written text

Reproducibility

Code: https://github.com/wangzhecheng/SkyScript

publicly available (https://github.com/wangzhecheng/SkyScript). The dataset and pre-trained models are released. Specific training hyperparameters (learning rate, batch size) for the continual pre-training are not explicitly detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Zero-shot transfer across multiple downstream tasks

Benchmarks:

SkyScript-retrieval (Cross-modal retrieval) [New]
SkyScript-classification (Fine-grained classification) [New]
7 Benchmark Datasets (e.g., UCM, AID, RESISC45) (Scene Classification)

Metrics:

Zero-shot Classification Accuracy
Recall@K
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
7 Benchmark Datasets (Average)	Accuracy gain	Not reported in the paper	Not reported in the paper	+6.2%
SkyScript	Image-Text Pairs	166000	2600000	+2434000
SkyScript	Semantic Tags	150	29000	+28850

Experiment Figures

Semantic diversity and geographic coverage of SkyScript

Main Takeaways

SkyScript successfully scales up remote sensing data to 2.6M pairs by leveraging geo-coordinates, overcoming the limitations of web crawling for this domain
The dataset covers a vastly larger semantic space (29K tags) than previous datasets (<150 classes), enabling fine-grained attribute recognition (e.g., road surfaces)
Continual pre-training on SkyScript yields significant zero-shot gains (+6.2%), demonstrating the quality of the automated image-text pairing pipeline

📚 Prerequisite Knowledge

Prerequisites

Remote Sensing (satellite imagery characteristics)
Vision-Language Models (CLIP architecture)
Geospatial Data (OpenStreetMap structure)

Key Terms

GSD: Ground Sampling Distance—the distance between pixel centers measured on the ground (e.g., 0.1m/pixel means each pixel represents 10cm), determining image resolution

OSM: OpenStreetMap—an open, crowdsourced geographic database where map features are described by 'tags' (key-value pairs like 'building=residential')

GEE: Google Earth Engine—a cloud computing platform for processing satellite imagery and other earth observation data

CLIP: Contrastive Language-Image Pre-training—a model trained to align images and text in a shared embedding space, enabling zero-shot classification

VLM: Vision-Language Model—a model that processes and relates both image and text inputs

Visual Groundability: The property of a semantic concept being visually identifiable in an image at a specific resolution (e.g., a 'stream' is visible, a 'house number' is not)

Zero-shot transfer: The ability of a model to perform a task (like classifying a new scene type) without having been explicitly trained on examples of that specific task