MLLM: Multi-modal Large Language Model—an AI model capable of processing and generating both text and visual data (e.g., GPT-4V, LLaVA)
Hallucination: The generation of content that appears plausible but is factually incorrect or unfaithful to the provided image content
Generative Task: A task where the model produces open-ended text, such as 'Describe this image'
Discriminative Task: A task where the model must classify or choose between options, here specifically answering 'Yes' or 'No' to verify visual details
CHAIR: Caption Hallucination Assessment with Image Relevance—a metric measuring the percentage of objects mentioned in a caption that do not actually exist in the image
AMBER Score: A composite score introduced in this paper combining the CHAIR metric (generative) and F1 score (discriminative) to rank MLLM performance
Existence Hallucination: Fabricating objects that are not present in the image at all
Attribute Hallucination: Correctly identifying an object but assigning it the wrong properties (e.g., wrong color, wrong action, wrong number)
Relation Hallucination: Incorrectly describing the relationship (usually spatial) between two existing objects
Counterfactual Prompting: Asking the model about something that isn't there (e.g., 'Is there a cat?' when there is none) to test if it hallucinates
LLM-free: Evaluation methods that do not require a separate Large Language Model (like GPT-4) to judge the correctness of outputs, relying instead on rule-based matching against annotations