← Back to Paper List

See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

Chih Yao Hu, Yang-Sen Lin, Yuna Lee, Chih-Hai Su, Jie-Ying Lee, Shr-Ruei Tsai, Chin-Yang Lin, Kuan-Wen Chen, Tsung-Wei Ke, Yu-Lun Liu
National Yang Ming Chiao Tung University, National Taiwan University
arXiv (2025)
MM Agent Reasoning Benchmark

📝 Paper Summary

Aerial Vision-and-Language Navigation (AVLN) Zero-shot Robotics VLM for Control
SPF repurposes vision-language models to control drones by prompting them to point at 2D waypoints on images, which are then geometrically converted into 3D flight commands.
Core Problem
Existing methods treat drone navigation as vague text generation or rely on limited skill libraries, failing to generalize to novel instructions or environments without extensive training.
Why it matters:
  • Textual outputs (e.g., 'move forward') lack the floating-point precision needed for safe aerial maneuvers
  • End-to-end policies trained on limited expert demonstrations fail to generalize to unseen real-world environments
  • Current VLM approaches restrict drones to pre-defined discrete skills, reducing control precision and trajectory smoothness
Concrete Example: When given the instruction 'Fly through the window,' a standard VLM might output the text 'Move forward 1 meter,' which is imprecise. SPF instead asks the VLM to click the pixel coordinates of the window center (e.g., [320, 240]), which is then mathematically converted into an exact 3D flight vector.
Key Novelty
See, Point, Fly (SPF)
  • Reframes action prediction as a 2D spatial grounding task: the VLM annotates a target pixel (waypoint) and a depth label on the current image
  • Uses geometric unprojection to lift the 2D waypoint into a 3D displacement vector based on the drone's camera intrinsics
  • Employes an adaptive scalar that adjusts flight speed based on proximity to obstacles and targets, enabling smooth approach without explicit depth sensors
Architecture
Architecture Figure Figure 2
The iterative perception-action loop of the SPF framework.
Evaluation Highlights
  • 93.9% success rate in DRL Simulator, outperforming the PIVOT baseline (28.7%) by over 65 percentage points
  • 92.7% success rate in real-world evaluations with a DJI Tello drone, compared to significantly lower reliability from baselines
  • Generalizes across models: Achieves 100% success rate with Gemini 2.5 Pro, Gemini 2.0 Flash, and GPT-4.1 on navigation tasks
Breakthrough Assessment
9/10
Eliminates the need for training navigation policies entirely while outperforming trained baselines by massive margins (>60%). A highly effective repurposing of VLM grounding capabilities.
×