BEV: Bird's-Eye View—a top-down 2D perspective of a 3D scene, often used for layout understanding
SoLP: Set-of-Line Prompting—The paper's novel technique of overlaying a grid coordinate system on BEV images to help VLMs propose precise camera coordinates
Visual Prompting: Modifying input images (e.g., adding lines, markers) to guide a model's attention or reasoning without changing its weights
Zero-shot: The ability to perform a task without having explicitly trained on data for that specific task
CIDEr: Consensus-based Image Description Evaluation—a metric for evaluating image captioning quality by comparing n-grams with human consensus
IoU: Intersection over Union—a metric for measuring the overlap between the predicted segmentation mask and the ground truth
SAM: Segment Anything Model—a foundation model for image segmentation that can cut out objects from images based on prompts
Back-projection: The mathematical process of mapping 2D image pixels back into 3D space coordinates using depth information
ScanNet: A large-scale dataset of annotated 3D indoor scenes