Zero-Shot: Performing a task without having been explicitly trained on examples of that specific task
VLM: Vision-Language Model—an AI model trained on images and text to understand visual inputs via natural language
Dynamic Stitching: A strategy to combine multiple images into single grid-layout images to bypass VLM input limits while retaining visual detail
Visual-Retrieval Benchmark: A novel benchmark proposed in the paper to evaluate how different image stitching layouts affect a VLM's ability to retrieve specific information
ScanRefer: A dataset for 3D visual grounding on ScanNet scenes containing user queries and target object locations
Nr3D: A dataset from ReferIt3D containing natural language queries for distinguishing objects in 3D scenes
SAM: Segment Anything Model—a model that can generate segmentation masks for any object in an image given a prompt
Grounding DINO: An open-set object detector that can detect arbitrary objects specified by text prompts
Chamfer Distance: A metric used to measure the similarity between two point clouds
Acc@0.25: Accuracy metric measuring the percentage of predicted bounding boxes with Intersection over Union (IoU) > 0.25 with the ground truth
SOTA: State-of-the-Art—the current best performance achieved by any method