ViT: Vision Transformer—a neural network that processes images by dividing them into patches and processing them with transformer blocks.
CLIP: Contrastive Language-Image Pre-training—a model trained to align image and text representations, commonly used as the vision encoder in MLLMs.
MLLM: Multimodal Large Language Model—an AI system capable of processing and generating both text and images (e.g., GPT-4V, LLaVA).
Penultimate layer: The second-to-last layer of a neural network; often used in ViT feature extraction to avoid over-fitting to the specific pre-training objective of the final layer.
Linear probing: A technique to analyze representations by training a simple linear classifier on top of frozen features.
POPE: A benchmark for evaluating object hallucination (seeing things that aren't there) in MLLMs.
MME: A comprehensive evaluation benchmark for MLLMs covering perception and cognition tasks.
OCR: Optical Character Recognition—the task of recognizing and reading text embedded within images.
Visual grounding: The ability of a model to locate and refer to specific objects within an image.