RS-MoE: A Vision–Language Model With Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering

📝 Paper Summary

Remote Sensing Image Captioning (RSIC) Visual Question Answering (VQA) Mixture of Experts (MoE) in Vision-Language Models

RS-MoE adapts the Mixture of Experts framework to remote sensing by using an Instruction Router to direct sub-tasks (theme, object, relationship) to specialized lightweight Large Language Models.

Core Problem

Standard Vision-Language Models (VLMs) fine-tuned for remote sensing struggle to capture the complex, diverse geographic objects and their relationships found in large-scale aerial imagery.

Why it matters:

Remote sensing images cover larger areas with more diverse objects than natural images, requiring specialized interpretation.
Existing methods often produce simple, repetitive captions that lack detailed relationship understanding.
Applying MoE to this domain is challenging due to sparsity-induced degradation when transferring from natural image pre-training.

Concrete Example: Traditional models might caption an image simply as 'A residential area.' RS-MoE aims to produce detailed captions like 'A dense residential area with green rooftops, where roads intersect near a park,' by routing object recognition and relationship inference to different experts.

Key Novelty

RS-MoE (Remote Sensing Mixture of Experts)

Replaces the standard feed-forward network in MoE with multiple lightweight Large Language Models (LLMs) acting as experts.
Introduces an Instruction Router that dynamically generates tailored prompts for each expert based on the input image and global instructions.
Decomposes the captioning task into three explicit sub-tasks: theme comprehension, object recognition, and relationship inference.

Architecture

The overall architecture of RS-MoE, showing the flow from Image Encoder to VLM Encoder, and finally to the MoE Block with the Instruction Router and multiple LLM experts.

Evaluation Highlights

RS-MoE-1B (lightweight variant) achieves performance comparable to 13B parameter VLMs on captioning tasks.
Achieves state-of-the-art results on five remote sensing image captioning datasets (RSICap, UCM-Captions, Sydney-Captions, RSICD, NWPU-Captions).
Demonstrates strong generalization on Remote Sensing Visual Question Answering (RSVQA) tasks without specific architectural changes.

Breakthrough Assessment

8/10

First application of MoE specifically for multimodal remote sensing. The use of LLMs as experts and the instruction router is a novel architectural shift for this domain, yielding efficiency and SOTA results.

⚙️ Technical Details

Problem Definition

Setting: Generate descriptive captions S = {s1, ..., sn} for a remote sensing image I.

Inputs: Remote sensing image I (W x H x 3)

Outputs: Textual caption S describing geographic objects O and their relationships R.

Pipeline Flow

Image Encoder (extracts visual features)
VLM Encoder (aligns visual features with instructions)
MoE Block (generates caption using multiple expert LLMs and Instruction Router)

System Modules

Image Encoder (Input Processing)

Extract visual features from the input remote sensing image

Model or implementation: ViT-G/14 (frozen)

VLM Encoder (Input Processing)

Align visual features with task instructions

Model or implementation: Q-Former style architecture (Self-Attention + Cross-Attention)

MoE Block

Generate caption using specialized experts guided by a router

Model or implementation: Mixture of Experts with Instruction Router and N lightweight LLMs

Novel Architectural Elements

Instruction Router: A module that generates dynamic, task-specific natural language prompts for each expert LLM instead of just gating scalar weights.
LLM-based Experts: Using lightweight LLMs as experts within the MoE block instead of simple Feed-Forward Networks.

Modeling

Base Model: RS-MoE-1B and RS-MoE-7B variants (using ViT-G/14 encoder)

Training Method: Two-stage training strategy with LoRA

Objective Functions:

Purpose: Guide experts to specialize.

Formally: Minimize loss of each LLM output against subtask-specific ground truth.
Purpose: Aggregate expert outputs.

Formally: Combine outputs to form final caption S.

Adaptation: LoRA (Low-Rank Adaptation) used to reduce trainable parameters.

Training Data:

Fine-tuned on RSICap dataset (~3,000 images)
Evaluation on UCM-Captions, Sydney-Captions, RSICD without additional fine-tuning

Compute: Not reported in the paper

Comparison to Prior Work

vs. RSGPT: RS-MoE uses a novel MoE architecture with specialized experts rather than just fine-tuning a single VLM.
vs. GeoChat: RS-MoE decomposes the task into sub-tasks (theme, object, relationship) handled by different experts.
vs. MoE-LLaVA: RS-MoE uses an Instruction Router to generate textual prompts for experts, whereas standard MoE uses gating networks for routing.

Limitations

The paper mentions potential sparsity-induced degradation when applying MoE to remote sensing, which requires a specific two-stage training strategy to mitigate.
Fine-tuning relies on a relatively small dataset (~3,000 images) compared to large-scale natural image datasets.
The complexity of managing multiple LLM experts could increase inference resource requirements compared to single small models, though the 1B variant is efficient.

Reproducibility

No code URL provided. Dataset (RSICap) is publicly available from prior work (RSGPT). Specific hyperparameters like learning rate or batch size are not explicitly detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

Image Captioning and Visual Question Answering on remote sensing datasets.

Benchmarks:

RSICap (Image Captioning (Fine-tuning dataset))
UCM-Captions (Image Captioning (Zero-shot evaluation))
Sydney-Captions (Image Captioning (Zero-shot evaluation))
RSICD (Image Captioning (Zero-shot evaluation))
RSVQA (Visual Question Answering)

Metrics:

BLEU
METEOR
ROUGE
CIDER
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on RSICap dataset shows RS-MoE variants outperforming baselines.
RSICap	BLEU-4	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

RS-MoE-1B achieves performance comparable to 13B VLMs, demonstrating the efficiency of the MoE design.
The model generalizes well to traditional datasets (UCM, Sydney, RSICD) without direct fine-tuning on them.
The two-stage training strategy effectively mitigates performance degradation caused by sparsity in MoE models for remote sensing.
Decomposing tasks into theme, object, and relationship sub-tasks improves caption precision and context.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (VLMs)
Mixture of Experts (MoE)
Transformer architecture (ViT, LLMs)
LoRA (Low-Rank Adaptation)

Key Terms

RSIC: Remote Sensing Image Captioning—generating text descriptions for aerial/satellite imagery.

MoE: Mixture of Experts—a machine learning technique where different parts of the model (experts) specialize in different tasks or data subsets.

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices.

VLM: Vision-Language Model—a model that processes both image and text inputs to perform tasks like captioning or VQA.

ViT: Vision Transformer—a transformer-based architecture for image processing that splits images into patches.

Instruction Router: A novel module in this paper that generates specific text prompts to guide each expert LLM based on visual features and task instructions.

RSVQA: Remote Sensing Visual Question Answering—answering natural language questions about remote sensing images.