← Back to Paper List

Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards

Aybora Koksal, A. Alatan
Middle East Technical University
2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) (2025)
MM RL Reasoning Benchmark

📝 Paper Summary

Remote Sensing Vision-Language Models (VLMs) Reinforcement Learning
The paper demonstrates that vision-language models can learn robust remote-sensing reasoning capabilities using as few as one training example by employing reinforcement learning with verifiable rule-based rewards instead of expensive caption supervision.
Core Problem
Remote sensing domain adaptation typically requires thousands to millions of expert-annotated image-caption pairs, which are expensive to collect and often lack the precision needed for fine-grained reasoning.
Why it matters:
  • Manual collection of paired satellite imagery and detailed captions is time-consuming and costly, limiting dataset diversity
  • Existing methods rely on LLM-generated 'pseudo-captions' which often lack the precision required for accurate fine-tuning
  • Standard supervised fine-tuning often fails to elicit reasoning capabilities in specialized domains without massive data scale
Concrete Example: A base model asked to 'Output the bounding box' of an object typically fails (0% accuracy on DIOR-RS). Standard solutions require training on thousands of box-caption pairs. This method succeeds with a single example by rewarding the model only when its predicted box overlaps sufficiently (IoU) with the ground truth.
Key Novelty
Few-Shot RLVR for Vision-Language Models
  • Adapts '1-shot RLVR' from text-only LLMs to multimodal satellite imagery, training on as few as one example using Policy Gradient optimization
  • Eliminates the need for caption supervision by using lightweight, rule-based binary rewards (correct/incorrect) or IoU-based rewards (bounding box overlap)
  • Demonstrates that base VLMs have latent reasoning capabilities that can be 'unlocked' via RL rather than learned from scratch via supervised fine-tuning
Evaluation Highlights
  • 1-shot RLVR yields double-digit gains over the base model (e.g., +11.65% on RSVQA-LR, +24.38% on DIOR-RS grounding) using a single training example
  • Scaling to 128 examples matches or exceeds the performance of baselines trained on 2,000 fully annotated samples across classification and VQA tasks
  • The 2B parameter model outperforms or rivals 7B parameter state-of-the-art models (like GeoChat and ScoreRS) which were trained on millions of examples
Breakthrough Assessment
8/10
Significantly lowers the barrier for domain-specific VLM adaptation. Proving that 1-shot RL works for multimodal reasoning (not just text math) in a specialized domain is a strong, practical contribution.
×