GeoChat:Grounded Large Vision-Language Model for Remote Sensing

📝 Paper Summary

Remote Sensing Vision-Language Models Multimodal Instruction Tuning

GeoChat adapts LLaVA-1.5 for remote sensing by fine-tuning on a newly created 318k multimodal instruction dataset, enabling unified object grounding, region-specific dialogue, and scene classification.

Core Problem

General-domain VLMs perform poorly on remote sensing imagery due to unique challenges like high resolution, diverse scales, and small objects, often hallucinating or failing to ground responses visually.

Why it matters:

Standard VLMs provide inaccurate or fabricated information when querying spatial RS images.
Existing RS methods (like classification-based VQA) lack open-ended conversation and instruction-following capabilities.
Lack of domain-specific multimodal instruction data prevents models from aligning with user queries about satellite imagery.

Concrete Example: When asked 'How many tennis courts are visible?', a general VLM might miss small objects or hallucinate, whereas GeoChat correctly identifies and grounds '10 tennis courts' by leveraging high-resolution inputs and region-specific training.

Key Novelty

Unified Grounded Remote Sensing VLM

Extends the LLaVA architecture with task-specific tokens (e.g., [grounding], [identify]) to switch between grounding, captioning, and conversation modes.
Interpolates positional encodings to handle higher-resolution images (504x504), essential for detecting small objects in satellite imagery.
Generates a massive domain-specific instruction dataset (318k pairs) by repurposing existing detection, classification, and VQA datasets into conversation formats.

Architecture

Overview of the GeoChat architecture and inference flow. It shows the Image Encoder, MLP Adaptor, and LLM components.

Evaluation Highlights

Outperforms state-of-the-art RS-specialized models like RSGPT on the RSVQA-LRBEN dataset (94.00% vs 94.00% on rural/urban classification, competitive overall).
Achieves robust zero-shot scene classification accuracy (84.43% on UCMerced), significantly outperforming general domain VLMs like LLaVA-1.5 (68.00%).
Demonstrates superior region-level captioning capabilities compared to MiniGPT-v2, achieving a METEOR score of 83.9 vs 10.0.

Breakthrough Assessment

8/10

First VLM to unify conversation and visual grounding specifically for remote sensing. The creation of a large-scale instruction dataset fills a major gap, though the architecture is a direct adaptation of LLaVA.

⚙️ Technical Details

Problem Definition

Setting: Multitask multimodal conversation where the model receives an image x, a text query q, and optionally region boxes b or task tokens t.

Inputs: Remote sensing image (high res), natural language query, optional region coordinates (bounding boxes), task-specific tokens

Outputs: Natural language response interleaved with spatial coordinates (visual grounding) or classification labels

Pipeline Flow

Image Input -> Visual Encoder (CLIP-ViT) -> Feature Upscaling (Positional Interpolation)
Visual Features -> MLP Adaptor -> Language Embedding Space
Text Input (Query + Task Tokens + Region Box) -> Tokenizer -> LLM
LLM (Vicuna) -> Generates Response (Text + Box Coordinates)

System Modules

Visual Backbone (Input Processing)

Encodes visual image data into feature representations

Model or implementation: CLIP-ViT(L-14) (frozen)

MLP Adaptor (Input Processing)

Projects visual tokens into the language model's embedding space

Model or implementation: Two-layer MLP (frozen)

Large Language Model

Processes multimodal inputs and generates grounded text responses

Model or implementation: Vicuna-v1.5 (7B) with LoRA adapters

Novel Architectural Elements

Integration of task-specific tokens ([grounding], [identify], [refer]) to explicitly switch reasoning modes within a single unified model
High-resolution positional interpolation (504x504) specifically adopted for small object detection in RS imagery within a LLaVA-like architecture

Modeling

Base Model: Vicuna-v1.5 (7B) coupled with CLIP-ViT-L-14

Training Method: Supervised Fine-Tuning with LoRA

Adaptation: LoRA (rank=64) on LLM weights (Wq, Wv)

Trainable Parameters: LoRA parameters only (Visual Encoder and MLP Adaptor frozen)

Training Data:

318k total instruction pairs generated from existing datasets
Includes 65k multi-round conversations, 56k VQA, 31.5k scene classification, 45k grounding descriptions, 40k region captioning

Key Hyperparameters:

learning_rate: Cosine schedule (max not explicitly specified, used AdamW)
batch_size: 144 (global)
image_resolution: 504x504
+ 2 more
lora_rank: 64
epochs: 1 epoch on full data, followed by 1600 steps on grounding data

Compute: Not reported in the paper

Comparison to Prior Work

vs. RSGPT: GeoChat is a unified model for all tasks (no per-task fine-tuning needed) and supports visual grounding and region inputs.
vs. MiniGPT-v2: GeoChat is fine-tuned on RS-specific data, significantly reducing hallucinations and improving small object detection.
vs. LLaVA-1.5: GeoChat handles higher resolution (504px vs 336px) and includes grounding/region capabilities not present in standard LLaVA.

Limitations

Performance on small object grounding is still low compared to larger objects.
The dataset relies on automated generation via Vicuna, which may introduce noise or biases.
Relies on a single image resolution (504x504), which may still be insufficient for very large satellite composites.

Reproducibility

Code: https://github.com/mbzuai-oryx/GeoChat

Code is publicly available at https://github.com/mbzuai-oryx/GeoChat. The dataset generation pipeline uses existing public datasets (DOTA, DIOR, FAIR1M, etc.) and Vicuna-v1.5 for text generation. Training uses standard LoRA procedures.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on Scene Classification and VQA; Supervised evaluation on Grounding/Referring tasks using a held-out test set.

Benchmarks:

UCMerced (Scene Classification)
AID (Scene Classification)
RSVQA-LRBEN (Visual Question Answering)
GeoChat Benchmark (Visual Grounding & Region Captioning) [New]

Metrics:

Accuracy (Top-1)
Accuracy@0.5 (IoU > 0.5)
ROUGE-L
METEOR
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot scene classification demonstrates GeoChat's superior domain adaptation compared to general-purpose VLMs.
UCMerced	Accuracy	68.00	84.43	+16.43
AID	Accuracy	51.00	72.03	+21.03
VQA results show GeoChat competes with specialist models while remaining a generalist.
RSVQA-LRBEN	Avg. Accuracy	92.29	90.70	-1.59
RSVQA-HRBEN (Test set 2)	Average Accuracy	68.40	72.30	+3.90
Grounding and region captioning results highlight the model's spatial reasoning capabilities.
GeoChat Benchmark	METEOR	10.0	83.9	+73.9
GeoChat Benchmark	Accuracy@0.5	9.1	16.0	+6.9

Experiment Figures

Qualitative results of GeoChat on grounding, referring object detection, and damage detection.

Main Takeaways

Domain-specific instruction tuning is crucial for Remote Sensing; general VLMs like LLaVA and MiniGPT-v2 fail significantly on RS tasks.
GeoChat successfully unifies multiple RS tasks (grounding, captioning, VQA) into a single model without requiring task-specific fine-tuning for each.
The model struggles with very small objects in grounding tasks, a persistent challenge in high-altitude satellite imagery.
Using high-resolution inputs (504x504) via positional interpolation is effective for retaining details necessary for RS analysis.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (e.g., LLaVA, CLIP)
Instruction Tuning and LoRA
Remote Sensing tasks (VQA, Scene Classification, Object Detection)

Key Terms

VLM: Vision-Language Model—a model that can process and reason about both images and text

Visual Grounding: The ability of a model to link specific words or phrases in text to corresponding regions (bounding boxes) in an image

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices

RS: Remote Sensing—scanning of the earth by satellite or high-flying aircraft in order to obtain information about it

VQA: Visual Question Answering—the task of answering natural language questions about the visual content of an image

CLIP: Contrastive Language-Image Pre-training—a model trained to predict which caption goes with which image

RoI: Region of Interest—a specific portion of an image selected for further processing or analysis

Hallucination: When a model generates plausible-sounding but factually incorrect information not present in the source input