WebLI-100B: A private dataset constructed by the authors containing 100 billion image-text pairs from the web, used to test scaling limits
SigLIP: Sigmoid Loss for Language Image Pre-training—a stable and efficient loss function for training CLIP-style models, used as the primary training objective here
Dollar Street: A dataset of household items photographed in homes around the world, used to evaluate how well models recognize objects across different cultures and income levels
GeoDE: Geographically Diverse Evaluation—a benchmark for evaluating object recognition systems across geographically diverse regions
Crossmodal-3600: A multilingual image captioning evaluation dataset covering 36 geographically diverse languages
Representation Bias (RB): A metric measuring the disparity in how often a model associates different demographic groups (e.g., gender) with positive or negative concepts
Association Bias (AB): A metric measuring stereotypical associations (e.g., gender vs. occupation) in model predictions
mt5: Multilingual T5—a transformer-based language model pre-trained on a massive multilingual dataset, used here as the text encoder tokenizer
PII: Personally Identifiable Information—sensitive data removed during dataset construction
Zero-shot retrieval: Evaluating a model's ability to find relevant images for text (or vice versa) without seeing any labeled examples from that specific task during training