← Back to Paper List

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

Haoyu Tong, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou, Chenghao Lin
Tianmushan Laboratory, Foshan Graduate School of Innovation
arXiv (2026)
MM Agent Reasoning Factuality

📝 Paper Summary

Aerial Vision-Language Navigation (Aerial VLN) Visual Prompting
ViSA replaces text-based scene graphs with visual prompting overlays (Set-of-Mark), enabling Vision-Language Models to verify spatial relationships directly on aerial images for zero-shot drone navigation.
Core Problem
Existing Aerial VLN methods rely on disjoint pipelines that convert images into discrete textual scene graphs, causing 'relationship hallucinations' because text fails to capture continuous 3D spatial layouts.
Why it matters:
  • Aerial views introduce unique domain shifts that break standard open-vocabulary detectors trained on ground data
  • Converting continuous visual scenes into symbolic text graphs (e.g., 'A is left of B') loses geometric context, leading to linguistic ambiguity and navigation failures
  • VLMs hallucinate objects and relationships when processing complex aerial urban scenes without explicit visual grounding
Concrete Example: Given an instruction to find a 'house with white roof on the left of Broadway', a standard pipeline might identify a house and the road separately but fail to correctly verify the 'left of' relationship due to perspective shifts, whereas ViSA overlays numeric markers on the image to verify the spatial topology visually.
Key Novelty
Visual-Spatial Reasoning (ViSA) Framework
  • Replaces the standard detection-and-planning pipeline with a visual prompting approach using Set-of-Mark (SoM), where the model reasons about numbered regions explicitly overlaid on the image
  • Decomposes navigation into three tightly coupled phases—Perception (generating visual prompts), Verification (explicit 3-stage logic check), and Execution (unprojecting pixels to 3D coordinates)—to prevent acting on hallucinations
Evaluation Highlights
  • Achieves 70.3% relative improvement in Success Rate (SR) over the fully trained state-of-the-art method on the CityNav Test-Unseen split
  • Surpasses the primary baseline GeoNav by 13.8% (relative) on Easy tasks and 71.2% (relative) on Hard tasks in the Val-Seen split
  • Significantly reduces the gap between Oracle Success Rate (OSR) and Actual Success Rate (SR) compared to GeoNav, indicating superior capability to explicitly confirm and stop at the correct target
Breakthrough Assessment
8/10
Proposes a significant paradigm shift from text-centric scene graphs to visual-centric prompting for aerial navigation, yielding substantial zero-shot performance gains over trained baselines.
×