Visual Grounding (VG): The task of locating a specific object or region in a scene based on a natural language description
Meta-annotations: Comprehensive, hierarchical descriptions of scene elements (objects/regions) used as a source to generate diverse task-specific samples
VLM: Vision-Language Model—AI models capable of understanding and generating text based on visual inputs (images)
BEV: Bird's Eye View—a top-down perspective of a 3D scene
ScanRefer: A standard benchmark dataset for 3D object localization using natural language
ScanQA: A benchmark dataset for question answering in 3D scenes
LEO: An embodied generalist agent capable of 3D vision-language tasks
AP: Average Precision—a metric for object detection/grounding accuracy
Acc@0.25: Accuracy metric where a prediction is correct if the Intersection over Union (IoU) with ground truth is > 0.25
Instruction Tuning: Fine-tuning LLMs on datasets formatted as instructions and responses to improve their ability to follow tasks