CLIP: Contrastive Language-Image Pretraining—a model trained to match images with their corresponding text captions
LAION: Large-scale Artificial Intelligence Open Network—a massive open dataset of image-text pairs used for training multimodal models
Zero-shot accuracy: The ability of a model to classify images into categories it has not explicitly seen during training, using only class names/descriptions
DINOv2: A self-supervised vision model used here to generate high-quality image embeddings for clustering
SemDeDup: Semantic Deduplication—a method to remove semantically redundant image pairs based on embedding similarity
SSP-Pruning: Self-Supervised-Prototypes Pruning—a prior method that prunes data by removing 'prototypical' (easy) samples close to cluster centroids
DataComp: A benchmark competition focusing on dataset curation for multimodal model training
ITM: Image-Text Matching—a score indicating how well an image matches its caption
VTAB: Visual Task Adaptation Benchmark—a suite of diverse vision tasks used to evaluate transfer learning