Grounding: Connecting linguistic terms (e.g., 'red chair') to specific physical objects or coordinates in a 3D scene.
Hallucination: When a model generates text describing objects or attributes that do not actually exist in the input scene.
3D-LLM: A Large Language Model adapted to take 3D spatial data (like point clouds) as input alongside text.
Sim-to-Real Transfer: Training a model on simulated/synthetic data and successfully applying it to real-world data without retraining.
ScanNet: A popular real-world dataset of 3D indoor scenes used for benchmarking.
IoU: Intersection over Unionโa metric measuring the overlap between a predicted bounding box and the ground truth box.
Dense Grounding: Associating every relevant noun phrase in a sentence with a specific object in the scene, rather than just the main subject.
LoRA: Low-Rank Adaptationโa parameter-efficient fine-tuning technique for LLMs.
ZeRO-2: A memory optimization technique for distributed training of large models.
FlashAttention: An algorithm that speeds up attention computation in Transformers while reducing memory usage.
Visual Grounding: The task of locating an object in an image or 3D scene based on a natural language description.