Few-Shot Vision-Language Reasoning for Satellite Imagery via Verifiable Rewards

📝 Paper Summary

Remote Sensing Vision-Language Models (VLMs) Reinforcement Learning

The paper demonstrates that vision-language models can learn robust remote-sensing reasoning capabilities using as few as one training example by employing reinforcement learning with verifiable rule-based rewards instead of expensive caption supervision.

Core Problem

Remote sensing domain adaptation typically requires thousands to millions of expert-annotated image-caption pairs, which are expensive to collect and often lack the precision needed for fine-grained reasoning.

Why it matters:

Manual collection of paired satellite imagery and detailed captions is time-consuming and costly, limiting dataset diversity
Existing methods rely on LLM-generated 'pseudo-captions' which often lack the precision required for accurate fine-tuning
Standard supervised fine-tuning often fails to elicit reasoning capabilities in specialized domains without massive data scale

Concrete Example: A base model asked to 'Output the bounding box' of an object typically fails (0% accuracy on DIOR-RS). Standard solutions require training on thousands of box-caption pairs. This method succeeds with a single example by rewarding the model only when its predicted box overlaps sufficiently (IoU) with the ground truth.

Key Novelty

Few-Shot RLVR for Vision-Language Models

Adapts '1-shot RLVR' from text-only LLMs to multimodal satellite imagery, training on as few as one example using Policy Gradient optimization
Eliminates the need for caption supervision by using lightweight, rule-based binary rewards (correct/incorrect) or IoU-based rewards (bounding box overlap)
Demonstrates that base VLMs have latent reasoning capabilities that can be 'unlocked' via RL rather than learned from scratch via supervised fine-tuning

Evaluation Highlights

1-shot RLVR yields double-digit gains over the base model (e.g., +11.65% on RSVQA-LR, +24.38% on DIOR-RS grounding) using a single training example
Scaling to 128 examples matches or exceeds the performance of baselines trained on 2,000 fully annotated samples across classification and VQA tasks
The 2B parameter model outperforms or rivals 7B parameter state-of-the-art models (like GeoChat and ScoreRS) which were trained on millions of examples

Breakthrough Assessment

8/10

Significantly lowers the barrier for domain-specific VLM adaptation. Proving that 1-shot RL works for multimodal reasoning (not just text math) in a specialized domain is a strong, practical contribution.

⚙️ Technical Details

Problem Definition

Setting: Few-shot adaptation of a pre-trained Vision-Language Model to remote sensing tasks via Reinforcement Learning

Inputs: Satellite image I and text prompt P (instruction)

Outputs: Reasoning chain <reasoning>...</reasoning> followed by final answer <answer>...</answer> or bounding box

Pipeline Flow

Visual Encoder (ViT) processes image
Adapter compresses visual tokens
LLM generates reasoning trace and answer
Reward Verification (Binary/IoU) calculates score
GRPO updates policy

System Modules

Visual Encoder (Input Processing)

Extract visual features from satellite imagery

Model or implementation: Vision Transformer (ViT) with 675M params (from Qwen2-VL-2B)

Vision-Language Adapter (Input Processing)

Compress visual features and inject positional context

Model or implementation: Single-layer cross-attention mechanism

Language Model

Generate reasoning chain and final answer

Model or implementation: Qwen2-1.5B (initialized from Qwen2-VL-2B)

Novel Architectural Elements

Integration of quantized IoU-based rewards directly into the RLVR loop for multimodal visual grounding tasks

Modeling

Base Model: Qwen2-VL-2B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward relative to group average.

Formally: Policy Gradient Loss favoring responses with high advantages.
Purpose: Prevent language degradation.

Formally: KL Divergence Loss penalizing deviation from the reference model.
Purpose: Enforce output structure.

Formally: Format compliance reward (binary signal for strictly following XML tags).
Purpose: Ensure task correctness.

Formally: Task-specific accuracy reward (Binary for QA/CLS, Quantized IoU for Grounding).

Adaptation: Full fine-tuning of LLM and Adapter

Trainable Parameters: Not explicitly reported in the paper (implies LLM + Adapter parameters)

Training Data:

Sampled from VHM-Instruct dataset
Few-shot sets: 1, 2, 4, 8, 16, 32, 64, 128 examples total
Random sampling strategy

Key Hyperparameters:

learning_rate: 1e-6
beta (KL penalty): 0.001
batch_size: 128
+ 5 more
group_size: 4 responses per image
temperature: 0.9
gradient_accumulation_steps: 1-8
max_prompt_length: 8192
max_completion_length: 8192

Compute: NVIDIA H100 GPUs (16 to 128 depending on availability), training for 1000-2000 steps

Comparison to Prior Work

vs. ScoreRS: Achieves competitive results with 2B model and 128 examples vs 7B model and multi-stage training on millions of examples
vs. GeoChat: Uses RLVR with minimal data vs Supervised Fine-Tuning (SFT) on large datasets
vs. DeepSeek-R1-Zero: Adapts the pure RL approach to multimodal vision-language tasks with IoU rewards [not cited in paper as direct baseline, but methodologically related]

Limitations

1-shot settings can induce mild task-specific overfitting (performance drops on the specific dataset split the example came from while generalizing elsewhere)
Visual grounding is significantly harder than classification/VQA in few-shot settings; while improved, it still lags behind full-data training
Performance plateau observed after 64-128 shots, suggesting diminishing returns for pure RLVR without additional data diversity

Reproducibility

Code: https://github.com/aybora/FewShotReasoning

Code and dataset available at https://github.com/aybora/FewShotReasoning. The paper details the reward functions (quantized IoU thresholds) and prompting strategies (XML tags) clearly. 4 samples per image used due to memory constraints on H100s.

📊 Experiments & Results

Evaluation Setup

Evaluation on standard remote sensing benchmarks using generated reasoning chains

Benchmarks:

RSVQA-LR (Visual Question Answering (Low Resolution))
RSVQA-HR (Visual Question Answering (High Resolution))
METER-ML (Scene Classification)
DIOR-RS (Visual Grounding / Object Detection)
LHRS-Bench (General Remote Sensing Knowledge/Reasoning)

Metrics:

Accuracy (for VQA and Classification)
IoU / Average Precision (for Visual Grounding)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
1-shot RLVR demonstrates immediate significant gains over the base model, proving latent reasoning capabilities can be unlocked with minimal signal.
RSVQA-LR	Accuracy	65.65	77.30	+11.65
DIOR-RS (Visual Grounding)	Accuracy/IoU	0.00	24.38	+24.38
Scaling to 128 examples allows the method to match or exceed baselines trained on significantly larger datasets (2,000 to millions).
RSVQA-LR	Accuracy	81.25	80.96	-0.29
UCM (Classification)	Accuracy	87.14	88.10	+0.96
RSVQA-LR	Accuracy	80.68	80.96	+0.28

Experiment Figures

Qualitative examples of the model's reasoning traces on classification and VQA tasks

Main Takeaways

Base VLMs possess latent reasoning abilities for remote sensing that can be unlocked with just 1 example and verifiable rewards, without caption supervision.
Training with ~128 curated examples via RLVR is a cost-effective 'sweet spot', matching performance of models trained on thousands of examples.
1-shot training exhibits 'task-specific overfitting' (performance drops on the specific dataset split used for the single example) but maintains or improves generalization on other tasks.
Visual Grounding benefits massively from RLVR (0% to ~30%) but remains the hardest task to match full-supervision performance on due to the precision required.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (architecture and fine-tuning)
Reinforcement Learning (Policy Gradient methods)
Remote Sensing tasks (VQA, Grounding, Classification)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using objective, rule-based checks (like 'is the answer correct?') to guide model training instead of human feedback or static labels

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs for the same input to reduce variance, without needing a separate value function network

IoU: Intersection over Union—a metric for object detection measuring the overlap between a predicted bounding box and the ground truth box (0 = no overlap, 1 = perfect match)

Chain of Thought (CoT): A prompting technique where the model is encouraged to generate intermediate reasoning steps before producing the final answer

VQA: Visual Question Answering—the task of answering natural language questions about an image

Visual Grounding: The task of locating an object in an image (usually via bounding box) based on a text description

KL Divergence: Kullback-Leibler Divergence—a statistical distance measure used here as a penalty to prevent the RL-trained model from drifting too far from the original base model's language distribution

ViT: Vision Transformer—an architecture that processes images as sequences of patches, used here as the visual encoder