IMU: Inertial Measurement Unit—sensors that measure force, angular rate, and orientation (motion data)
Perceiver Resampler: A neural network module that converts variable-length input features into a fixed number of token embeddings
FSDP: Fully Sharded Data Parallel—a technique to distribute model parameters across multiple GPUs to save memory
QLoRA: Quantized Low-Rank Adaptation—a fine-tuning method that uses quantized weights (e.g., 4-bit) and trains only small adapter layers
CIDEr: Consensus-based Image Description Evaluation—an automated metric for evaluating the quality of image captions against human references
VQA: Visual Question Answering—a task where a computer answers text questions based on an image
LLM: Large Language Model—a massive neural network trained on text to generate human-like language
SPICE: Semantic Propositional Image Caption Evaluation—a metric that evaluates caption quality based on scene graphs
ROUGE-L: Recall-Oriented Understudy for Gisting Evaluation (Longest Common Subsequence)—a metric measuring text overlap between generated and reference summaries