LMM: Large Multimodal Model—a neural network capable of processing and generating both text and images (e.g., GPT-4V, LLaVA)
Instruction Tuning: Fine-tuning a pre-trained model on a dataset of (instruction, output) pairs to improve its ability to follow user commands
CLIP: Contrastive Language-Image Pre-training—a model that learns to map images and text to a shared embedding space, allowing for metric-based matching between them
Visual Grounding: The ability of a model to link textual concepts (e.g., 'red ball') to specific regions or objects in an image
LLaVA: Large Language and Vision Assistant—an open-source LMM architecture that connects a vision encoder (like CLIP) to an LLM (like Vicuna)
k-means: A clustering algorithm that partitions data points into k groups based on similarity
ROUGE-L: A metric for evaluating text generation by measuring the longest common subsequence between the generated text and a reference summary
VQA: Visual Question Answering—a task where a model must answer a natural language question about an image
In-context learning: Providing a model with a few examples of a task within the prompt to guide its generation without updating its weights