VLN: Vision-Language Navigation—the task of an agent navigating an environment to reach a goal described in natural language
SoM: Set-of-Mark—a visual prompting technique where images are overlaid with numbered masks/markers, allowing models to reference specific regions by ID
VLM: Vision-Language Model—a multimodal AI model capable of understanding and reasoning over both image and text inputs
SPL: Success weighted by Path Length—a metric balancing the success rate with the efficiency of the path taken
OSR: Oracle Success Rate—the percentage of episodes where the agent passes near the target at any point, regardless of whether it stops there
Hallucination: In this context, when a model generates spatial descriptions or identifies objects that are inconsistent with the actual visual facts
Unprojection: The process of mapping a 2D pixel coordinate from an image back into a 3D world coordinate using camera intrinsics and depth/altitude data