LMM: Large Multi-Modal Model—a model capable of processing and generating content across multiple modalities (e.g., image and text)
LRV-Instruction: Large-scale Robust Visual Instruction—the authors' proposed dataset containing 400k visual instructions with balanced positive and negative samples
GAVIE: GPT4-Assisted Visual Instruction Evaluation—the authors' proposed evaluation method using GPT-4 to score accuracy and relevancy without ground truth
Negative Instructions: Instructions asking about objects, attributes, or relationships that are NOT present in the image, forcing the model to deny or correct the premise
POPE: Polling for Object Existence—a benchmark that evaluates hallucination by asking binary 'Is there a...' questions
Visual Genome: A dataset with detailed visual annotations (objects, attributes, relationships) used as the source for generating instructions
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices