3D Scene Graph (3DSG): A structured representation where nodes represent objects in 3D space and edges represent spatial or semantic relationships between them
Open-vocabulary: The ability to recognize and reason about objects or concepts not explicitly defined or seen during the model's training phase
LVLM: Large Vision-Language Model—a model capable of understanding images and generating text descriptions (e.g., LLaVA)
SAM: Segment Anything Model—a foundation model for generating segmentation masks for any object in an image
CLIP: Contrastive Language-Image Pre-Training—a model that learns to associate images with text descriptions in a shared embedding space
IoU: Intersection over Union—a metric measuring the overlap between two bounding boxes or masks
Affordance: The actionable properties of an object (e.g., a chair 'affords' sitting)
DBSCAN: Density-Based Spatial Clustering of Applications with Noise—a clustering algorithm used here to clean 3D point clouds