VLM: Vision-Language Model—a model trained to associate images and text, often enabling zero-shot classification
Zero-shot classification: Classifying images into categories the model wasn't explicitly trained on, usually by comparing image features to text embeddings of class names
CLIP: Contrastive Language-Image Pre-training—a popular VLM architecture that learns by matching image-caption pairs
MNIST: A classic dataset of handwritten digits (0-9), typically considered a 'solved' problem in computer vision
NegCLIP: A VLM variant trained with hard negative examples (incorrect captions that are grammatically similar to correct ones) to improve understanding of relations and word order
ViT: Vision Transformer—an architecture that applies the Transformer mechanism directly to sequences of image patches
Top-k accuracy: A metric that considers a prediction correct if the true label is among the model's top k predicted probabilities
Distilled benchmark: A carefully selected subset of tasks that correlates highly with the full suite's performance, enabling faster evaluation