See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation

📝 Paper Summary

Aerial Vision-and-Language Navigation (AVLN) Zero-shot Robotics VLM for Control

SPF repurposes vision-language models to control drones by prompting them to point at 2D waypoints on images, which are then geometrically converted into 3D flight commands.

Core Problem

Existing methods treat drone navigation as vague text generation or rely on limited skill libraries, failing to generalize to novel instructions or environments without extensive training.

Why it matters:

Textual outputs (e.g., 'move forward') lack the floating-point precision needed for safe aerial maneuvers
End-to-end policies trained on limited expert demonstrations fail to generalize to unseen real-world environments
Current VLM approaches restrict drones to pre-defined discrete skills, reducing control precision and trajectory smoothness

Concrete Example: When given the instruction 'Fly through the window,' a standard VLM might output the text 'Move forward 1 meter,' which is imprecise. SPF instead asks the VLM to click the pixel coordinates of the window center (e.g., [320, 240]), which is then mathematically converted into an exact 3D flight vector.

Key Novelty

See, Point, Fly (SPF)

Reframes action prediction as a 2D spatial grounding task: the VLM annotates a target pixel (waypoint) and a depth label on the current image
Uses geometric unprojection to lift the 2D waypoint into a 3D displacement vector based on the drone's camera intrinsics
Employes an adaptive scalar that adjusts flight speed based on proximity to obstacles and targets, enabling smooth approach without explicit depth sensors

Architecture

The iterative perception-action loop of the SPF framework.

Evaluation Highlights

93.9% success rate in DRL Simulator, outperforming the PIVOT baseline (28.7%) by over 65 percentage points
92.7% success rate in real-world evaluations with a DJI Tello drone, compared to significantly lower reliability from baselines
Generalizes across models: Achieves 100% success rate with Gemini 2.5 Pro, Gemini 2.0 Flash, and GPT-4.1 on navigation tasks

Breakthrough Assessment

9/10

Eliminates the need for training navigation policies entirely while outperforming trained baselines by massive margins (>60%). A highly effective repurposing of VLM grounding capabilities.

⚙️ Technical Details

Problem Definition

Setting: Iterative target-reaching process in 3D space using monocular RGB input

Inputs: Current visual observation I_t and natural language instruction l

Outputs: 3D motion command m_t (yaw, pitch, throttle)

Pipeline Flow

VLM Planner (Image + Text → 2D Waypoint + Depth Label)
Adaptive Scaler (Depth Label → Step Size)
Action Mapper (2D Waypoint + Step Size → 3D Displacement)
Reactive Controller (3D Displacement → Drone Velocity)

System Modules

VLM Planner

Identify the navigation target on the image based on instruction

Model or implementation: Various VLMs (e.g., Gemini 2.5 Pro, GPT-4o)

Adaptive Scaler (Control Logic)

Convert abstract depth label into physical step size

Model or implementation: Non-linear scaling function

Action Mapper (Control Logic)

Transform 2D image plan to 3D motion vector

Model or implementation: Pinhole Camera Unprojection

Reactive Controller

Execute 3D displacement via velocity commands

Model or implementation: Closed-loop feedback controller

Novel Architectural Elements

Replacement of text-generation head with 2D spatial grounding head for navigation control
Integration of VLM-predicted discrete depth labels with non-linear scaling for implicit depth estimation without sensors

Modeling

Base Model: Model-agnostic framework evaluated with Gemini 2.5 Pro, Gemini 2.0 Flash, GPT-4.1, Claude 3.7 Sonnet, Llama 4 Maverick

Comparison to Prior Work

vs. TypeFly: SPF generates continuous 3D actions via visual grounding rather than selecting from a limited discrete skill set
vs. PIVOT: SPF directly predicts the optimal waypoint rather than selecting from a pre-computed set of candidates, enabling higher precision
vs. RT-Trajectory: SPF is training-free and uses geometric unprojection, whereas RT-Trajectory requires training a policy to follow the waypoints

Limitations

Reactive control is limited by VLM inference latency (approx. 1-3 seconds), affecting performance with fast dynamic obstacles
Relies on VLM's internal spatial understanding, which can suffer from hallucinations
Adaptive step size is a heuristic and may not perfectly match physical depth in all scenes
Performance depends on the prompt phrasing and the underlying VLM's capability

Reproducibility

Project page provided (https://spf-web.pages.dev). Code repository URL not explicitly listed in text. Evaluated using standard DRL Simulator and commercially available DJI Tello drones. Uses closed-source VLMs (Gemini, GPT-4) via API.

📊 Experiments & Results

Evaluation Setup

Zero-shot navigation in simulation (DRL Sim) and real-world (DJI Tello)

Benchmarks:

DRL Simulator (Simulated Aerial Navigation)
Real-world Indoor/Outdoor (Physical UAV Navigation) [New]

Metrics:

Success Rate (SR)
Completion Time
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Simulation results showing SPF significantly outperforming baselines across all tasks.
DRL Simulator	Success Rate (Average)	0.9	93.9	+93.0
DRL Simulator	Success Rate (Average)	28.7	93.9	+65.2
DRL Simulator (Obstacle Avoidance)	Success Rate	16	92	+76
Real-world evaluation confirming simulation findings.
Real-world	Success Rate	Not reported in the paper	92.7	Not reported in the paper
Ablation showing the impact of structured grounding vs. text generation.
Navigation Task	Success Rate	7	100	+93
Navigation Task	Completion Time (seconds)	50.25	35.20	-15.05

Experiment Figures

Real-world completion times and success comparison across 5 representative tasks (obstacle avoidance, long horizon, etc.).

Main Takeaways

Visual grounding (pointing) is a far superior interface for VLM control than text generation, achieving near-perfect success rates where text methods fail completely
Adaptive depth scaling allows monocular drones to navigate efficiently without depth sensors by inferring relative distance from the VLM
The framework is highly robust to model choice, achieving >87% success even with 'Lite' VLM variants
Closed-loop control enables tracking of dynamic targets (people/objects) despite the latency inherent in large model inference

📚 Prerequisite Knowledge

Prerequisites

Pinhole camera model (intrinsic parameters)
Vision-Language Model (VLM) prompting
Basic control theory (velocity/position control)

Key Terms

AVLN: Aerial Vision-and-Language Navigation—controlling drones using visual inputs and language instructions

Visual Grounding: The process of linking language concepts (e.g., 'the red door') to specific pixels or bounding boxes in an image

Unprojection: The geometric process of converting 2D image coordinates back into 3D space rays using camera parameters

Zero-shot: The ability of a model to perform a task without having been explicitly trained or fine-tuned on examples of that specific task

DRL Simulator: A high-fidelity drone racing simulator used as a benchmark for physics-based aerial navigation

Affordance: The possibility of an action on an object or environment (e.g., a window affords 'flying through')

VLM: Vision-Language Model—a large AI model trained on images and text that can understand and generate content in both modalities