GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

📝 Paper Summary

Remote Sensing Vision-Language Models (RS-VLMs) Hallucination Mitigation in Multimodal LLMs

GeoReason aligns the internal reasoning of remote sensing models with their final decisions using a logic-driven dataset and a consistency-aware reinforcement learning strategy that penalizes logical drift.

Core Problem

Current remote sensing models suffer from 'logical hallucinations' where correct answers are derived from flawed reasoning or positional shortcuts rather than spatial evidence, undermining reliability in decision-making.

Why it matters:

Decoupling between reasoning and answers makes models unreliable for strategic spatial tasks like functional zoning or capacity estimation.
Perception-centric models often use 'pseudo-reasoning'—guessing the right answer for the wrong reasons—which prevents genuine high-level cognitive interpretation.

Concrete Example: In a parking area utilization task, a baseline model might correctly select 'Sparsely used' but justify it by paradoxically claiming the area is 'near saturation' with 'numerous objects', revealing a complete disconnect between vision and logic.

Key Novelty

Consistency-Aware Reinforcement Learning with Logical Consistency Reward (LCR)

Creates a dataset (GeoReason-Bench) by synthesizing geometric primitives into high-fidelity reasoning trajectories, ensuring ground truth logic exists.
Uses a novel Logical Consistency Reward (LCR) during training that permutes option orders and checks if the model's reasoning trace leads to the same logical conclusion, penalizing reasoning that drifts despite identical evidence.

Architecture

The overall architecture of GeoReason, illustrating the transition from perception to reasoning. It shows the data curation pipeline (left) and the two-stage training strategy (right).

Evaluation Highlights

+19.65% improvement in Reasoning task accuracy over the base Qwen2.5-VL model on GeoReason-Bench.
Achieves 51.27% Overall Accuracy, significantly outperforming baselines which struggle with logical grounding.
Logical Consistency Reward (LCR) specifically drives Reasoning Accuracy from 36.49% (standard RL) to 43.51%, proving it effectively mitigates hallucinations.

Breakthrough Assessment

8/10

Strong methodological contribution in aligning CoT with outcomes via option permutation in RL. Addresses a critical, specific failure mode (pseudo-reasoning) in RS-VLMs with verifiable gains.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal reasoning and visual question answering for remote sensing imagery.

Inputs: Remote sensing image I and structured query q.

Outputs: Reasoning trajectory T (Chain-of-Thought) and final answer A.

Pipeline Flow

Input Processing: Image I + Query q
Reasoning Generation: Generate Chain-of-Thought trace t
Answer Prediction: Predict final answer a

System Modules

Backbone VLM

Jointly processes image and text to generate reasoning and answer

Model or implementation: Qwen2.5-VL-7B (with LoRA rank 16)

Modeling

Base Model: Qwen2.5-VL-7B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward based on group relative advantage.

Formally: Policy gradient with importance sampling and clipping, using group-normalized advantage A_i.
Purpose: Penalize logical inconsistency.

Formally: LCR reward r_LCR checks if answer stays consistent under option permutation P(q). Grants bonus alpha if consistent, penalty eta if contradictory.

Adaptation: LoRA (rank 16)

Trainable Parameters: LoRA parameters (rank 16)

Training Data:

GeoReason-Bench: 4,000 samples
Subset D_SFT: 1k Perception-Logic samples
Subset D_RL: 3k Deductive-Reasoning samples (MCQs)

Key Hyperparameters:

learning_rate_sft: 1e-4
learning_rate_rl: 1e-6
epochs_sft: 1
+ 2 more
steps_rl: 1200
lora_rank: 16

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard GRPO: GeoReason adds LCR (Logical Consistency Reward) to specifically target reasoning-answer alignment, whereas standard GRPO only rewards accuracy and format.
vs. Perception-centric models: GeoReason forces transition to deductive reasoning rather than relying on visual pattern recognition shortcuts.

Limitations

Dataset size is relatively small (4,000 trajectories) compared to general VLM datasets.
Reliance on synthesized reasoning trajectories might introduce artifacts if the synthesis pipeline (GPT-4o) has errors.
No statistical significance tests reported for the accuracy improvements.

Reproducibility

No explicit code URL or repository is provided in the text. The dataset GeoReason-Bench is introduced but its release status is not specified. Base model Qwen2.5-VL is public.

📊 Experiments & Results

Evaluation Setup

Visual Question Answering on remote sensing imagery requiring multi-step reasoning.

Benchmarks:

GeoReason-Bench (Remote Sensing VQA / Reasoning) [New]

Metrics:

Overall Accuracy (OA)
Average Accuracy (AA)
Per-category Accuracy (Count, Color, Shape, Reason, Scene)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GeoReason-Bench	Reasoning Accuracy	23.86	43.51	+19.65
GeoReason-Bench	Overall Accuracy (OA)	Not reported in the paper	51.27	Not reported in the paper
GeoReason-Bench	Reasoning Accuracy	31.93	43.51	+11.58
GeoReason-Bench	Reasoning Accuracy	36.49	43.51	+7.02

Experiment Figures

Qualitative comparison of reasoning on a parking area utilization task between standard GRPO (WR-CA) and GeoReason (CR-CA).

Main Takeaways

GeoReason achieves state-of-the-art performance on the proposed benchmark, particularly in reasoning tasks (+19.65% vs base).
Standard RL (GRPO) improves general accuracy but struggles with reasoning accuracy, indicating it may still rely on shortcuts.
The Logical Consistency Reward (LCR) is critical for bridging the gap between reasoning trace and final answer, significantly boosting reasoning accuracy by penalizing logical drift.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO or similar policy optimization)
Vision-Language Models (VLMs)
Chain-of-Thought (CoT) prompting

Key Terms

RS-VLM: Remote Sensing Vision-Language Model—AI models designed to interpret satellite/aerial imagery using text and visual inputs.

CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer.

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that uses group-level relative rewards to estimate baselines, eliminating the need for a critic network.

LCR: Logical Consistency Reward—a proposed reward signal that penalizes the model if its answer changes when the options are permuted, ensuring the decision is anchored in the reasoning trace.

Logical Hallucination: A phenomenon where a model provides a correct final answer but supports it with incorrect or contradictory reasoning.

SFT: Supervised Fine-Tuning—training on labeled examples to initialize the model's behavior before reinforcement learning.

Geometric Primitives: Basic shapes and structural features (e.g., scale, orientation, density) extracted from images to form the basis of spatial reasoning.