3DVG: 3D Visual Grounding—locating objects in 3D space based on natural language descriptions
VLM: Vision-Language Model—AI models capable of understanding and reasoning about both image and text inputs (e.g., GPT-4o, Gemini)
Zero-Shot: The ability of a model to perform a task without having been explicitly trained on examples of that specific task
Acc@0.25: Accuracy metric measuring if the Intersection over Union (IoU) between the predicted and ground truth bounding boxes is at least 0.25
Acc@0.5: Accuracy metric measuring if the Intersection over Union (IoU) is at least 0.5
Open-Vocabulary: The ability to recognize and process object categories that were not present in the training set
Point Cloud: A set of data points in space representing a 3D shape or object
Homogeneous Transformation Matrix: A 4x4 matrix used to describe the rotation and translation between two coordinate systems (e.g., world to camera)
IoU: Intersection over Union—a metric used to evaluate the overlap between two bounding boxes