LMM: Large Multi-Modal Model—an AI model capable of processing and generating content across multiple modalities like text, images, and video
Context Parallelism: A distributed training/inference technique where the sequence of tokens is split across multiple GPUs to handle contexts longer than a single GPU's memory
Logits-Masked Language Modeling Head: An optimization where the final classification layer (head) only computes predictions for relevant positions (e.g., the last token) rather than the entire sequence, saving significant memory
Prefill: The initial phase of LLM inference where the model processes the input prompt (all history tokens) to generate the Key-Value cache before generating new tokens
SFT: Supervised Fine-Tuning—training a model on labeled examples (instructions and outputs) to improve its ability to follow user commands
RoPE: Rotary Positional Embeddings—a method for encoding token positions in Transformers that generalizes better to sequence lengths not seen during training
MME: A comprehensive evaluation benchmark for multimodal large language models
Hallucination: When a model generates incorrect or nonsensical information not supported by the input (e.g., describing an object not present in the image)
VQA: Visual Question Answering—the task of answering natural language questions about an image