← Back to Paper List

A Recipe for Improving Remote Sensing VLM Zero Shot Generalization

Aviad Barzilai, Yotam Gigi, Amr Helmy, Vered Silverman, Yehonathan Refael, Bolous Jaber, Tomer Shekel, George Leifman, Genady Beryozkin
Google Research
arXiv (2025)
MM Pretraining Benchmark

πŸ“ Paper Summary

Remote Sensing (RS) Foundation Models Vision-Language Models (VLMs)
This paper introduces two large-scale remote sensing image-caption datasets and a tailored training recipe for the MaMMUT VLM, achieving state-of-the-art zero-shot retrieval and enabling self-supervised localization.
Core Problem
Remote sensing (RS) lacks large-scale, high-quality image-text datasets, preventing foundation models from generalizing well to orbital viewpoints and low-resolution imagery compared to ground-level photos.
Why it matters:
  • Current general-purpose VLMs struggle with the unique spatial relationships and top-down perspectives of satellite imagery.
  • Scarcity of paired text-image data in the RS domain limits the development of models capable of open-vocabulary detection and complex scene understanding.
Concrete Example: Existing models optimized for ground-level images often fail to understand orbital perspectives. The paper notes that despite being trained on broad captions, standard models struggle to localize specific features like 'airport' or 'stadium' without specialized RS training data.
Key Novelty
RS-Specific Data Synthesis & Smooth-Attention Localization
  • Generates the RS-Landmarks dataset by aligning satellite imagery with Google Maps locations and using Gemini 1.5 Pro to write detailed, grounded captions.
  • Creates RS-WebLI by training classifiers to filter the massive WebLI dataset for aerial/satellite content.
  • Introduces 'Smooth-Attention-Operation', a sliding-window attention pooling mechanism that generates robust segmentation masks from image-level supervision.
Evaluation Highlights
  • The proposed MT-RSWebLI-RSLandmarks model outperforms all public baselines (e.g., SigLip, CLIP-RS) on zero-shot retrieval benchmarks (RSICD, RSIVL, MLRSNet).
  • Achieves significantly higher Recall@1 on RSICD compared to general-purpose baselines like SigLip-B/16.
  • Zero-shot classification on unseen categories (RS-Landmarks-89-holdout) nearly matches the performance of a model trained on those specific categories.
Breakthrough Assessment
8/10
Strong contribution via high-quality dataset synthesis (18M+ images) and a practical recipe for adapting VLMs to remote sensing, demonstrating clear SOTA on standard benchmarks.
×