Evaluation Setup
Zero-shot transfer and fine-tuning on various Remote Sensing downstream tasks.
Benchmarks:
- Zero-shot Classification (ZSC) (Image Classification)
- Remote Sensing Cross-Modal Text–Image Retrieval (RSCTIR) (Image-Text Retrieval)
- Semantic Localization (SeLo) (Weakly Supervised Visual Grounding)
Metrics:
- Top-1 Accuracy
- Recall@K (R@1, R@5, R@10)
- Mean IoU (likely, though specific metric for SeLo not explicitly detailed in text snippets)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| GeoRSCLIP significantly outperforms baselines in Zero-shot Classification tasks. |
| ZSC Tasks (Aggregate) |
Top-1 Accuracy (implied) |
Not reported in the paper |
Not reported in the paper |
+3% to +20%
|
| GeoRSCLIP shows consistent gains in cross-modal retrieval and localization. |
| RSCTIR Tasks (Aggregate) |
Recall/Retrieval Score (implied) |
Not reported in the paper |
Not reported in the paper |
+3% to +6%
|
| SeLo Tasks (Aggregate) |
Localization Score (implied) |
Not reported in the paper |
Not reported in the paper |
+4% to +5%
|
Main Takeaways
- Scale matters: Increasing the RS dataset size to 5 million pairs (RS5M) enables effective domain transfer for VLMs.
- Synthetic captioning with quality control (rotation invariance) is a viable strategy for scaling up domain-specific data where text pairs are scarce.
- The proposed DVLM (GeoRSCLIP) generalizes better to RS tasks than the original GVLM (CLIP) without losing the benefits of the original pre-training.