MLLM: Multi-modal Large Language Model—an LLM adapted to accept inputs from other modalities like images or audio
Instruction Tuning: Fine-tuning a pre-trained LLM on dataset of (instruction, output) pairs to improve its ability to follow user commands
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices
CLIP: Contrastive Language-Image Pre-training—a model trained to align images and text in a shared embedding space
Point Cloud: A set of data points in a 3D coordinate system, commonly used to represent 3D shapes or scenes
PointNet++: A deep neural network architecture that directly processes point clouds by learning hierarchical features
Zero-shot: Evaluating a model on tasks or classes it has not explicitly seen during training
Binary Locating Metric: A metric proposed in this paper where the model's output location is considered correct if it falls within the ground-truth bounding box