Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

📝 Paper Summary

Earth Observation (EO) Vision-Language Models (VLMs)

State-of-the-art VLMs like GPT-4V excel at high-level scene understanding and captioning of Earth observation data but fail systematically at fine-grained spatial tasks like counting, localization, and change detection.

Core Problem

It is unclear how well Large Vision-Language Models (VLMs) trained on natural images transfer to Earth Observation (EO) tasks, which involve unique aerial/satellite viewpoints and specialized domain requirements.

Why it matters:

Analyzing satellite imagery typically requires deep learning expertise or manual annotation, creating barriers for non-experts like disaster relief analysts
VLMs could democratize EO data access via natural language interfaces, but their actual reliability on geospatial tasks is unknown due to a lack of comprehensive benchmarks
Current benchmarks focus on natural images (e.g., MMMU, SEED-Bench), leaving a gap in understanding VLM performance on remote sensing data

Concrete Example: If an analyst asks a VLM to count damaged buildings in a disaster zone using 'before' and 'after' satellite images, GPT-4V might describe the damage qualitatively but fail to output an accurate count or bounding boxes, rendering it useless for quantitative damage assessment.

Key Novelty

Comprehensive EO Benchmark for VLMs

Constructs a multi-task benchmark spanning scene understanding, localization/counting, and change detection using diverse datasets (landmark recognition, RSICD, xBD, etc.)
Evaluates both closed-source (GPT-4V) and open-source (LLaVA, InstructBLIP) models on specialized geospatial tasks rather than generic visual reasoning

Architecture

Overview of the benchmark tasks categorized into Scene Understanding, Localization & Counting, and Change Detection.

Evaluation Highlights

GPT-4V achieves 67% accuracy on a new aerial landmark recognition task, significantly outperforming open-source models
On object counting tasks, GPT-4V fails significantly: R² = 0.08 for aerial animal detection and R² = 0.20 for tree counting
For change detection on disaster imagery, GPT-4V scores R² = 0.10 for counting destroyed buildings, showing a systematic failure to reason about temporal changes

Breakthrough Assessment

7/10

Provides a critical reality check for the application of general-purpose VLMs to scientific domains. While not proposing a new model, the benchmarking methodology and findings are valuable for the EO community.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot evaluation of instruction-following VLMs on Earth Observation tasks

Inputs: Remote sensing image(s) (satellite or aerial) + Natural language instruction/question

Outputs: Text response (caption, class label, count, or bounding box coordinates)

Pipeline Flow

Input: Satellite/Aerial Image + Text Prompt
VLM Inference (Zero-shot)
Output Parsing (Text to Label/Count/Bbox)
Evaluation against Ground Truth

System Modules

VLM Inference

Process visual and textual input to generate a natural language response

Model or implementation: GPT-4V, LLaVA-v1.5, InstructBLIP, etc.

Modeling

Base Model: GPT-4V(ision) (primary subject of investigation)

Training Method: Zero-shot prompting only

Compute: Not reported in the paper

Comparison to Prior Work

vs. Specialist models: VLMs offer open-ended natural language interaction but lack fine-grained spatial precision
vs. Open-source VLMs: GPT-4V shows superior world knowledge (landmark recognition) and reasoning but shares similar spatial limitations

Limitations

Potential data contamination: Unclear if GPT-4V was pre-trained on evaluation datasets like RSICD or xBD
Limited error analysis: Lacks systematic categorization of failure modes (e.g., distinguishing perceptual vs. reasoning errors)
Static benchmark: Does not account for rapid evolution of VLM capabilities or new modalities (segmentation)
Inability to test non-optical/multi-spectral data: Current VLMs only support standard RGB images

Reproducibility

Benchmark data and code promised on Hugging Face. Specific prompt templates are provided in the paper (Figures 4, 9, 10). Models evaluated are standard public or API-based models.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation across 3 categories: Scene Understanding, Localization & Counting, Change Detection

Benchmarks:

Aerial Landmark Recognition (Scene Understanding (Classification)) [New]
RSICD (Image Captioning)
BigEarthNet / fMoW-WILDS / PatternNet (Land Use/Land Cover Classification)
DIOR-RSVG (Object Localization (Referring Expression Comprehension))
NEON-Tree / COWC / xBD / Aerial Animal Detection (Object Counting)
xBD (Change Detection) (Change Detection)

Metrics:

Accuracy (Classification)
RefCLIPScore (Captioning)
F1 Score (Classification)
Mean IoU (Localization)
R-squared (Counting)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scene understanding results highlight GPT-4V's strong world knowledge compared to open-source models.
Aerial Landmark Recognition	Accuracy	0.40	0.67	+0.27
RSICD	RefCLIPScore	0.79	0.75	-0.04
Localization and counting results demonstrate significant limitations in spatial reasoning for VLMs.
DIOR-RSVG	Mean IoU	0.68	0.16	-0.52
NEON-Tree	R-squared	1.0	0.20	-0.80
Aerial Animal Detection	R-squared	1.0	0.08	-0.92
xBD (Building Counting)	R-squared	1.0	0.68	-0.32
Change detection results show GPT-4V cannot reliably reason about differences between two images for damage assessment.
xBD (Destroyed Buildings)	R-squared	1.0	0.10	-0.90

Experiment Figures

Map of the US showing GPT-4V's landmark recognition accuracy by state.

Qualitative comparison of image captions generated by different VLMs vs. human ground truth.

Main Takeaways

GPT-4V possesses strong high-level world knowledge, enabling it to recognize landmarks and describe scenes with detail often exceeding human annotators.
Current VLMs are essentially 'blind' to fine-grained spatial tasks; they struggle significantly with counting small objects, localizing with bounding boxes, and detecting changes between images.
Performance is highly sensitive to object size and label ambiguity; VLMs perform better on high-resolution images with clear categories (PatternNet) than on lower-resolution or ambiguous tasks (fMoW-WILDS).
Qualitative inspection reveals that correct answers sometimes stem from incorrect or generic reasoning, and conversely, good reasoning can sometimes be misled by visual artifacts (e.g., off-nadir angles).

📚 Prerequisite Knowledge

Prerequisites

Familiarity with remote sensing/Earth observation data types (satellite imagery)
Understanding of Vision-Language Models (VLMs) and prompting
Basic knowledge of standard CV metrics (IoU, F1 score, R-squared)

Key Terms

EO: Earth Observation—collecting data about Earth's physical, chemical, and biological systems via remote sensing technologies like satellites

VLM: Vision-Language Model—AI models that can process and reason about both images and text

IoU: Intersection over Union—a metric for measuring the accuracy of an object detector on a particular dataset

R2: Coefficient of determination—a statistical measure of how well the regression predictions approximate the real data points

RefCLIPScore: A metric measuring image-caption alignment and caption-reference similarity using the CLIP model

NAIP: National Agriculture Imagery Program—a USDA program that acquires aerial imagery during the agricultural growing season

REC: Referring Expression Comprehension—the task of localizing a specific object in an image given a natural language description

OCR: Optical Character Recognition—electronic conversion of images of typed, handwritten or printed text into machine-encoded text