CLIP: Contrastive Language-Image Pre-training—a model trained to match images and texts by maximizing similarity of correct pairs and minimizing others
Attention Pooling: A mechanism to aggregate a sequence of embeddings into a single vector, often using a specific query vector to select relevant information
VLM: Vision-Language Model—a model that processes and relates visual and textual information
MLLM: Multimodal Large Language Model—large models capable of processing both text and images, often used here to generate synthetic captions
Sigmoid Loss: A binary classification loss applied to every pair, allowing for multiple positive matches per image, unlike Softmax which forces a single positive
mIoU: Mean Intersection over Union—a standard metric for semantic segmentation measuring the overlap between predicted and ground truth regions
R@1: Recall at 1—the percentage of times the correct item is retrieved as the top result
Zero-shot: Testing a model on a task or category it was not explicitly trained on
CC3M/CC12M: Conceptual Captions datasets containing 3 million and 12 million image-text pairs respectively
YFCC15M: A subset of the Yahoo Flickr Creative Commons dataset containing 15 million image-text pairs