RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

📝 Paper Summary

Remote Sensing Vision-Language Models Multimodal Large Language Models (MLLMs)

RSUniVLM is a 1B-parameter remote sensing model that unifies image-level, region-level, and pixel-level tasks into a single text-generation framework using a granularity-oriented mixture of experts.

Core Problem

Existing remote sensing VLMs lack pixel-level understanding (segmentation) and struggle with multi-image inputs (change detection), limiting them to coarse image or region-level tasks.

Why it matters:

Fine-grained understanding is critical for practical applications like land-cover mapping and environmental monitoring where precise boundaries matter.
Current models require separate specialized architectures for segmentation vs. captioning, preventing unified reasoning across different granularities.
Change detection requires reasoning across multiple images, which most single-image VLMs cannot handle effectively.

Concrete Example: A user asks to 'Segment the road area in this image that have changed in the second image.' Existing models like GeoChat can only provide bounding boxes or text descriptions, failing to generate the precise pixel-level mask required for the task.

Key Novelty

Granularity-oriented Mixture of Experts (G-MoE)

Decouples the LLM's reasoning into three specific experts: Image-level (global semantics), Region-level (localization), and Pixel-level (segmentation/fine details).
Uses a training-free task router to dynamically assign inputs to the correct expert based on the task type, preventing interference between different visual granularities.
Unifies all outputs, including segmentation masks, into a text-only format using semantic descriptors, enabling end-to-end training without task-specific heads.

Architecture

Overview of RSUniVLM architecture and the Granularity-oriented Mixture of Experts (G-MoE).

Evaluation Highlights

+29.7% accuracy improvement on VRSBench-Ref visual grounding compared to GeoChat (69.31% vs 39.6%).
Achieves 86.86% accuracy on SIRI-WHU scene classification, outperforming LHRS-Bot-7B (83.94%) and GeoChat-7B (43.67%).
Competitive zero-shot change detection on WHU-CD (F1 71.38%) compared to supervised methods trained on 5% labeled data.

Breakthrough Assessment

8/10

First RS-specialized VLM to unify pixel-level segmentation with high-level reasoning and multi-image change detection in a single model, significantly outperforming larger baselines.

⚙️ Technical Details

Problem Definition

Setting: Unified multimodal generation where input images X_v and instruction X_instruct are mapped to textual response X_a, which may represent natural language, bounding boxes, or segmentation masks.

Inputs: Single or multiple remote sensing images and a text instruction.

Outputs: Text sequence representing answers, bounding boxes [x1, y1, x2, y2], or segmentation masks (encoded as semantic descriptors).

Pipeline Flow

Image Encoder (extracts visual features)
MLP Connector (aligns visual features to text space)
Task Router (selects expert based on task type)
G-MoE LLM (generates response using specific expert)

System Modules

Image Encoder

Extract visual features from RS images

Model or implementation: SigLIP-400m

Task Router

Route tokens to the specific expert based on task granularity (Image/Region/Pixel)

Model or implementation: Training-free gating mechanism

G-MoE Experts

Process tokens with specialized FFNs for different granularities

Model or implementation: QWen2-0.5B (modified FFNs)

Novel Architectural Elements

Granularity-oriented Mixture of Experts (G-MoE) where experts are explicitly defined by visual granularity (Image, Region, Pixel) rather than learned automatically.
Unified text-only output representation handling bounding boxes and segmentation masks (via semantic descriptors) in one autoregressive stream.

Modeling

Base Model: QWen2-0.5B (LLM) + SigLIP-400m (Vision Encoder)

Training Method: Two-stage training: (1) Full-parameter multi-task pre-training, (2) G-MoE fine-tuning

Objective Functions:

Purpose: Maximize likelihood of target answer tokens.

Formally: Autoregressive language modeling loss p(Xa|Xv, Xinstruct).

Adaptation: G-MoE (duplicating FFNs 3 times) in Stage 2

Trainable Parameters: Approximately 1B total parameters

Training Data:

1.2 million instruction-following samples
Sources: RSVQA, NWPU, fMoW, DIOR-RSVG, Landcover, LEVIR-MCI, plus general data (GeoChat, ShareGPT4V).

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 4
warmup_ratio: 0.03
+ 1 more
epochs: 1 (per stage)

Compute: 4 Nvidia A40 GPUs (40GB) for approx 30 hours

Comparison to Prior Work

vs. GeoChat: Supports pixel-level segmentation and multi-image inputs; significantly higher grounding accuracy (+29.7%).
vs. LHRS-Bot: Smaller parameter count (1B vs 7B) with superior performance on multiple benchmarks; supports pixel-level tasks.
vs. Change-Agent: Unified architecture handling VQA/Grounding/Segmentation alongside change detection, rather than a specialized model.

Limitations

Weak multi-turn conversation capability compared to larger general VLMs.
Unable to perform generative image tasks like super-resolution or dehazing.
Limited to 1B parameter scale in current implementation.

Reproducibility

Code: https://github.com/LianXH/RSUniVLM

Code and model available at https://github.com/LianXH/RSUniVLM. Datasets are public compilations. Semantic descriptor implementation follows Text4Seg.

📊 Experiments & Results

Evaluation Setup

Zero-shot and fine-tuned evaluation across multiple RS tasks.

Benchmarks:

RSVQA-LR/HR (Visual Question Answering)
DIOR-RSVG (Visual Grounding)
VRSBench-Ref (Visual Grounding)
LEVIR-MCI (Change Captioning)
WHU-CD (Change Detection)
Vaihingen (Semantic Segmentation)

Metrics:

Accuracy
mIoU (mean Intersection over Union)
F1 Score
BLEU/METEOR/CIDEr (for captioning)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Visual Grounding results demonstrate significant superiority over existing RS-VLMs, particularly on VRSBench-Ref.
VRSBench-Ref	Accuracy @ 0.5	39.6	69.31	+29.71
DIOR-RSVG	Accuracy @ 0.5	48.04	72.47	+24.43
Change Captioning results show RSUniVLM rivals specialized task-specific models.
LEVIR-MCI	CIDEr	136.56	139.80	+3.24
Scene Classification results show strong performance on SIRI-WHU but mixed results on other datasets.
SIRI-WHU	Accuracy	62.66	68.13	+5.47
AID	Accuracy	91.26	81.18	-10.08
Ablation study confirms the effectiveness of G-MoE over LoRA and standard MoE.
Average VQA	Accuracy	82.75	91.57	+8.82
Average VG	Accuracy	64.56	70.90	+6.34

Experiment Figures

Radar chart comparing RSUniVLM with other models across tasks and visual examples of capabilities.

Main Takeaways

RSUniVLM sets a new state-of-the-art for visual grounding in remote sensing with a much smaller model size (1B vs 7B/13B).
The G-MoE architecture effectively decouples conflicting granularities (image vs pixel), yielding better performance than standard MoE or monolithic fine-tuning.
Unified text-based segmentation allows for competitive zero-shot segmentation performance without task-specific decoder heads.
Multi-image comprehension capability allows for effective change detection and captioning, tasks typically requiring separate specialized pipelines.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs) architecture (ViT + LLM)
Mixture of Experts (MoE)
Semantic Segmentation via text generation

Key Terms

G-MoE: Granularity-oriented Mixture of Experts—an architecture where experts are specialized for image-level, region-level, or pixel-level tasks rather than generic tokens.

Semantic descriptors: A method to represent segmentation masks as text tokens, allowing an LLM to generate masks auto-regressively.

Visual Grounding: The task of locating objects in an image based on a natural language description (outputting bounding boxes).

Change Detection: identifying differences between two remote sensing images of the same area taken at different times.

RLE: Run-Length Encoding—a form of lossless data compression used here to represent segmentation masks efficiently.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.