VH: Visual Hallucination—when an MLLM generates text details about an image that are factually incorrect
MLLM: Multi-modal Large Language Model—AI system capable of processing and generating both text and images (e.g., GPT-4V)
CLIP: Contrastive Language-Image Pre-training—a model that learns to map images and text to a shared embedding space
DINO v2: A self-supervised vision transformer model known for learning robust visual features without text supervision
VHTest: The proposed tool/framework for generating diverse visual hallucination instances
OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text
OEQ: Open-Ended Question—questions requiring free-form text answers
YNQ: Yes/No Question—questions constrained to a binary yes or no answer