VLM: Vision-Language Model—AI that processes both images and text to perform tasks like captioning or question answering
VQA: Visual Question Answering—The task of answering natural language questions about an image
metric depth estimation: Predicting the absolute distance (in meters) of pixels in an image from the camera, rather than just relative depth
point cloud: A set of data points in space representing a 3D shape or object
CoT: Chain-of-Thought—A prompting technique where the model generates intermediate reasoning steps to solve complex problems
open-vocabulary detection: Object detection that can identify and label objects using arbitrary text descriptions rather than a fixed list of categories
canonicalize: Transforming data into a standard or normalized format; here, aligning 3D coordinates to a common geodetic system (e.g., aligning the floor to the horizontal plane)
ViT: Vision Transformer—A neural network architecture for image processing that splits images into patches, used here as the visual encoder
SI units: International System of Units (e.g., meters, centimeters) used for quantitative measurements