Text Before Vision: Staged Knowledge Injection Matters for Agentic RLVR in Ultra-High-Resolution Remote Sensing Understanding

📝 Paper Summary

Remote Sensing Agentic Reinforcement Learning Visual Question Answering (VQA)

By cold-starting multimodal models with Earth-science text-only QA, the system acquires reasoning structures that stabilize and amplify subsequent agentic reinforcement learning for ultra-high-resolution remote sensing.

Core Problem

Multimodal models struggle with ultra-high-resolution remote sensing because they must localize tiny targets in massive pixel spaces, and standard reinforcement learning fails to navigate these vast spaces without structured domain priors.

Why it matters:

Visual evidence acquisition in 8K+ resolution images is a bottleneck; models often fail to zoom in on the correct tiny regions
Standard RL agents blindly explore evidence paths without domain rules, leading to unstable optimization and poor generalization
Existing post-training methods (SFT or RLVR alone) struggle to improve the 'reasoning boundary' (Pass@32) in these specialized scenarios

Concrete Example: When asked to identify a specific facility in an 8376x8378 pixel image, a standard model might randomly zoom in or fail to find the region. Without text-learned rules (e.g., specific coastal features associated with the facility), the RL agent cannot learn an effective zoom-in policy from scratch.

Key Novelty

Text-Before-Vision Staged Knowledge Injection

Cold-start the model with large-scale Earth-science text-only QA to instill domain concepts and reasoning structures (CoT) before visual training
Use 'Hard-Example Pre-warming': Re-use difficult UHR image-text samples from SFT during the subsequent Agentic RLVR stage to stabilize tool-use learning

Architecture

The automated pipeline for Earth-science text QA data generation and quality control.

Evaluation Highlights

Achieves 60.40% Pass@1 on XLRS-Bench, establishing a new state-of-the-art
Significantly outperforms larger general-purpose models (e.g., GPT-5.2, Gemini 3.0 Pro, Intern-S1) on UHR remote sensing tasks
Removing Chain-of-Thought (CoT) from the text cold-start data causes a massive -5.91 drop in Pass@1, proving text structure drives performance

Breakthrough Assessment

8/10

Counter-intuitive finding that text-only data drives vision-heavy UHR performance. Sets new SOTA on a challenging benchmark and proposes a replicable data/training pipeline.

⚙️ Technical Details

Problem Definition

Setting: Visual Question Answering with Verifiable Rewards on Ultra-High-Resolution (UHR) Images

Inputs: UHR Image (avg. 8376x8378 pixels) and a natural language question

Outputs: Answer with reasoning trace, utilizing zoom-in actions

Pipeline Flow

Input Processing (Question + UHR Image)
Reasoning & Tool Use (Iterative Zoom-in)
Final Answer Generation

System Modules

Base Agent (Reasoning & Tool Use)

Generate reasoning traces and decide whether to zoom in or answer

Model or implementation: QwenVL2.5-7B (fine-tuned)

Zoom-in Tool (Reasoning & Tool Use)

Crop and resize the high-resolution image to the specified region

Model or implementation: Deterministic Image Processing Function

Modeling

Base Model: QwenVL2.5-7B

Training Method: Staged training: Cold-start SFT followed by Agentic RLVR (GRPO)

Objective Functions:

Purpose: Supervised Fine-Tuning.

Formally: Standard cross-entropy loss on next-token prediction.
Purpose: Reinforcement Learning.

Formally: GRPO objective maximizing verifiable rewards (correctness) of the final answer.

Training Data:

148,777 Earth-Science text QA pairs (Generated via automated pipeline from textbooks/papers)
SuperRS-VQA (12,228 UHR samples)
DeepEyes-47K (General domain RL data)

Key Hyperparameters:

sft_epochs: 1
rlvr_steps: 80

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepEyes: Adds 'Text-Before-Vision' staged training recipe (Earth-science text cold start) which significantly boosts UHR reasoning
vs. Standard SFT: Demonstrates that RLVR is needed for UHR navigation, but SFT is crucial for 'pre-warming' representations
vs. General MLLMs (GPT-5.2, Gemini 3.0): Outperforms them on XLRS-Bench despite smaller size (7B) due to specialized domain injection

Limitations

Requires a specialized automated pipeline to generate high-quality Earth-science text QA
Reasoning boundary (Pass@32) gains saturate with more text data, suggesting a limit to text-only priors
RLVR optimization can be unstable without the proposed text/SFT warm-up

Reproducibility

Code: https://github.com/MiliLab/Text-Before-Vision

Code available at GitHub. Earth-science text QA data pipeline described in detail (Textbooks/Papers -> Generator -> Knowledge Graph Verifier). Base model is QwenVL2.5.

📊 Experiments & Results

Evaluation Setup

Visual Question Answering on Ultra-High-Resolution images using fixed inference budget

Benchmarks:

XLRS-Bench (Ultra-High-Resolution Remote Sensing VQA)

Metrics:

Pass@1 (Average Performance)
Pass@32 (Reasoning Boundary)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
XLRS-Bench	Pass@1	Not reported in the paper	60.40	Not reported in the paper
XLRS-Bench	Pass@1	60.40	54.49	-5.91
XLRS-Bench	Pass@32	Not reported in the paper	Not reported in the paper	-0.50

Experiment Figures

Comparison of SFT, RLVR, and Agentic RLVR on Pass@32 (Reasoning Boundary).

Scaling effects of Earth-science text QA data on Pass@1 and Pass@32.

Main Takeaways

High-quality Earth-science text-only QA is a primary driver of visual reasoning gains in UHR scenarios, even without images.
Reasoning boundary (Pass@32) is driven by domain-prior coverage, while average performance (Pass@1) is driven by reasoning structure (CoT) and agentic tuning.
Agentic RLVR is unstable without sufficient domain supervision; 'pre-warming' with hard image-text pairs is essential.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Supervised Fine-Tuning (SFT)
Visual Question Answering (VQA)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL where the reward is binary and deterministic based on the correctness of the final answer

UHR: Ultra-High-Resolution—Referring to images with extremely high pixel counts (e.g., 8K resolution) common in remote sensing

Agentic RLVR: An RLVR setup where the model can take active steps (like using tools) to acquire evidence before answering

Cold-start SFT: The initial phase of supervised training used to initialize the model before reinforcement learning begins

Pass@k: A metric measuring the probability that at least one correct answer is generated within k samples

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of samples to stabilize training

CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer