Evaluation Setup
Zero-shot navigation in simulation (DRL Sim) and real-world (DJI Tello)
Benchmarks:
- DRL Simulator (Simulated Aerial Navigation)
- Real-world Indoor/Outdoor (Physical UAV Navigation) [New]
Metrics:
- Success Rate (SR)
- Completion Time
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Simulation results showing SPF significantly outperforming baselines across all tasks. |
| DRL Simulator |
Success Rate (Average) |
0.9 |
93.9 |
+93.0
|
| DRL Simulator |
Success Rate (Average) |
28.7 |
93.9 |
+65.2
|
| DRL Simulator (Obstacle Avoidance) |
Success Rate |
16 |
92 |
+76
|
| Real-world evaluation confirming simulation findings. |
| Real-world |
Success Rate |
Not reported in the paper |
92.7 |
Not reported in the paper
|
| Ablation showing the impact of structured grounding vs. text generation. |
| Navigation Task |
Success Rate |
7 |
100 |
+93
|
| Navigation Task |
Completion Time (seconds) |
50.25 |
35.20 |
-15.05
|
Main Takeaways
- Visual grounding (pointing) is a far superior interface for VLM control than text generation, achieving near-perfect success rates where text methods fail completely
- Adaptive depth scaling allows monocular drones to navigate efficiently without depth sensors by inferring relative distance from the VLM
- The framework is highly robust to model choice, achieving >87% success even with 'Lite' VLM variants
- Closed-loop control enables tracking of dynamic targets (people/objects) despite the latency inherent in large model inference