BEV: Bird's Eye View—a top-down 2D projection of a 3D scene, providing a map-like global context
STO-markers: Spatial-Temporal Object markers—visual ID tags (e.g., 'C1', 'C2') overlaid on objects in images to track identity across different views and time
VLM: Vision-Language Model—an AI model trained to understand and generate content based on both image and text inputs
Point Cloud: A set of data points in space representing a 3D shape or object
Mask3D: A specific 3D instance segmentation model used to identify and isolate objects within a 3D point cloud
SQA3D: A benchmark dataset for 3D Situated Question Answering
EM-1: Exact Match score—a metric measuring the percentage of predictions that match the ground truth exactly
IoU: Intersection over Union—a metric to evaluate the accuracy of an object detector by comparing the overlap between predicted and ground truth bounding boxes