MLLM: Multimodal Large Language Model—an AI that can process both text and images to generate responses
Hallucination: When a model generates plausible-sounding but factually incorrect information (e.g., describing an object not present in the image)
Atomic Proposition: A simple, indivisible statement that can be clearly judged as either True or False (e.g., 'The cat is black')
VQA: Visual Question Answering—a task where a model answers questions about the content of an image
Greedy Decoding: A generation strategy where the model always picks the single most likely next word
Stochastic Decoding: A generation strategy where the model samples the next word based on probability, introducing randomness
AUROC: Area Under the Receiver Operating Characteristic—a performance metric for classification tasks; 1.0 is perfect, 0.5 is random guessing
BLEU/CIDEr/METEOR: Standard metrics for evaluating text generation by matching words against human-written references
CLIPScore: A metric measuring how well an image and a caption match using the CLIP model's embedding space