VLM: Vision-Language Model—a model trained to associate images with text descriptions
CLIP: Contrastive Language-Image Pre-training—a specific VLM architecture that aligns text and image embeddings
Proxy: An automatically generated data instance (image crop, text label, and 3D point cloud crop) used for pretraining without manual labels
PointNet++: A deep neural network architecture designed to consume raw point clouds directly by aggregating features hierarchically
Frustum: A 3D region extruded from a 2D image bounding box into 3D space, used to isolate point clouds corresponding to 2D detections
DBSCAN: Density-Based Spatial Clustering of Applications with Noise—a clustering algorithm used here to clean point clouds within 3D frustums
Zero-shot transfer: Evaluating a model on categories it was not explicitly trained on, using only category names/descriptions
RGB-D: Image data containing both color (RGB) and Depth information
LiDAR: Light Detection and Ranging—a sensor method that measures distance to a target with a pulsed laser, creating sparse 3D point clouds