UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning

📝 Paper Summary

Aerial Vision-Language Models Visual Question Answering (VQA) Reinforcement Learning for VLMs

UAV-VL-R1 adapts lightweight vision-language models to aerial imagery by combining supervised fine-tuning with multi-stage Group Relative Policy Optimization (GRPO) to enforce structured, interpretable reasoning chains.

Core Problem

General-purpose vision-language models degrade on UAV imagery due to unique bird's-eye perspectives, high resolution, and complex spatial semantics, often producing unexplainable or hallucinated outputs.

Why it matters:

UAV applications like disaster monitoring require real-time, robust reasoning that general models cannot provide due to domain gaps
Standard Supervised Fine-Tuning (SFT) encourages pattern memorization rather than true spatial reasoning, failing in structured tasks like counting or location inference
Existing aerial datasets lack the reasoning annotations needed to train interpretable 'Chain-of-Thought' capabilities

Concrete Example: In an aerial image, a standard VLM might correctly identify a car but fail to count vehicles in a crowded intersection or explain their spatial relationships, whereas UAV-VL-R1 produces a structured trace (<think>...</think>) detailing the counting process before answering.

Key Novelty

Hybrid SFT + Multi-Stage GRPO Curriculum

Combines SFT for initial semantic alignment with a three-stage reinforcement learning curriculum (Attributes → Objects → Spatial relations) to progressively build reasoning complexity
Utilizes Group Relative Policy Optimization (GRPO) to estimate advantages from group-wise output comparisons, eliminating the need for a separate value function model
Enforces a dual-tag output format (<think> for reasoning, <answer> for result) via rule-based rewards to ensure interpretability

Architecture

The training pipeline comprising SFT initialization and three-stage GRPO reinforcement learning.

Evaluation Highlights

Outperforms the 36x larger Qwen2-VL-72B-Instruct model on UAV tasks (72.13% vs 46.67% accuracy)
Achieves 48.17% higher zero-shot accuracy than the base Qwen2-VL-2B-Instruct model
Requires only 3.9 GB memory (FP16) or 2.5 GB (INT8) for inference, enabling edge deployment

Breakthrough Assessment

8/10

Demonstrates that a small (2B) model can radically outperform giant (72B) models in specialized domains via structured reinforcement learning, without needing human preference labels.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering on high-resolution aerial imagery with structured reasoning requirements

Inputs: Aerial image I and natural language question q

Outputs: Structured sequence containing reasoning trace r and final answer a

Pipeline Flow

Visual Encoder (Feature Extraction)
SFT Module (Semantic Alignment)
RL Module (Structured Reasoning Optimization)
Output Generation (Dual-Tag Format)

System Modules

Visual Encoder

Extract visual features from high-resolution UAV imagery

Model or implementation: Qwen2-VL-2B (Vision Tower)

LLM Backbone

Generate reasoning traces and answers based on visual features

Model or implementation: Qwen2-VL-2B-Instruct with LoRA adapters

Modeling

Base Model: Qwen2-VL-2B-Instruct

Training Method: Hybrid SFT + Multi-Stage Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize likelihood of correct reasoning path and answer during SFT.

Formally: L_SFT = -log P(r, a | i, q)
Purpose: Optimize policy using relative rewards within a group.

Formally: J_GRPO = E [ (π/π_old) * A_i - β * D_KL(π || π_ref) ]
Purpose: Define reward based on structure and correctness.

Formally: R = r_format (0 or 0.5) + r_accuracy (0 or 1.5)

Adaptation: LoRA (rank=32, alpha=48)

Trainable Parameters: Visual encoder and LoRA adapters (backbone frozen)

Training Data:

HRVQA-VL dataset: 50,019 samples
Covering 8 tasks across 3 complexity stages (Attributes, Objects, Spatial)

Key Hyperparameters:

lora_rank: 32
lora_alpha: 48
reward_max: 2.0
+ 1 more
inference_memory_fp16: 3.9 GB

Compute: Inference: 3.9 GB (FP16), 2.5 GB (INT8)

Comparison to Prior Work

vs. Qwen2-VL-72B: Achieves higher accuracy on domain-specific tasks despite being 36x smaller due to RL-driven structured reasoning
vs. Standard SFT: Uses GRPO to explore reasoning paths rather than just mimicking patterns, improving generalization
vs. PPO-based methods: Uses group relative advantages instead of a value model, reducing computational overhead [not cited in paper but implied by method choice]

Limitations

SFT stage may reduce reasoning diversity in mathematical tasks (e.g., counting) before RL compensates
Performance depends heavily on the quality of rule-based rewards (format and accuracy)
Evaluation is limited to the HRVQA-VL dataset constructed by the authors

Reproducibility

The paper introduces the HRVQA-VL dataset (50,019 samples) but states 'code availability' as not provided in the text. Training relies on the Qwen2-VL-2B base model. Hyperparameters for LoRA are provided (rank 32, alpha 48).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on the HRVQA-VL dataset across three task complexity stages.

Benchmarks:

HRVQA-VL (Aerial Visual Question Answering) [New]

Metrics:

Accuracy (Multitask)
Zero-shot Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against general-purpose VLMs shows significant gains for the specialized lightweight model, even against much larger baselines.
HRVQA-VL	Accuracy	26.93	72.13	+45.20

Experiment Figures

The three-stage task curriculum (Stage A, B, C) and the HRVQA-VL dataset structure.

Main Takeaways

RL-based training (GRPO) drastically improves performance over SFT alone, particularly for structured reasoning tasks in the aerial domain.
Lightweight specialized models (2B) can outperform generalist giants (72B) when trained with domain-specific reasoning curricula.
SFT is crucial for initial semantic alignment but can hinder numerical reasoning diversity; RL recovers and enhances this capability.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Reinforcement Learning from Human Feedback (RLHF)
Low-Rank Adaptation (LoRA)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input within a group, reducing variance without a value network

SFT: Supervised Fine-Tuning—training a model on labeled examples to establish initial capabilities before reinforcement learning

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains only small rank-decomposition matrices

PPO: Proximal Policy Optimization—a standard RL algorithm that typically requires a separate value model (critic) and can be unstable in complex reasoning tasks

Chain-of-Thought: A reasoning strategy where the model generates intermediate steps before the final answer to improve accuracy and interpretability

FP16: Half-precision floating-point format (16-bit) used to reduce memory usage during model inference

INT8: 8-bit integer quantization, further compressing the model for deployment on resource-constrained devices