Scaling Pre-training to One Hundred Billion Data for Vision Language Models

📝 Paper Summary

Vision-Language Pre-training (VLP) Scaling Laws Dataset Curation Fairness and Inclusivity in AI

Scaling noisy web data to 100 billion examples yields negligible gains on traditional Western-centric benchmarks but significantly improves cultural diversity, long-tail concept recognition, and performance on low-resource languages.

Core Problem

Existing vision-language datasets have plateaued around 10 billion examples, and it is unknown whether pushing this scale by an order of magnitude (to 100B) yields further benefits or diminishing returns.

Why it matters:

Standard benchmarks (like ImageNet) may be saturated, masking the potential benefits of scaling for other critical dimensions like cultural inclusivity.
Quality filtering techniques (like CLIP filtering) used to maximize benchmark performance might inadvertently harm diversity by removing long-tail cultural concepts.
Understanding the trade-offs of massive data scaling is crucial for resource allocation in training next-generation multimodal systems.

Concrete Example: A ViT-L model trained on 10 billion examples achieves only 35.9% accuracy on the culturally diverse Dollar Street geo-localization task. When trained on 100 billion examples, the same model jumps to 41.7% accuracy, recognizing household items from underrepresented regions that smaller datasets miss.

Key Novelty

WebLI-100B: The first empirical study of 100-billion scale VLP

Constructs a massive dataset of 100 billion image-text pairs (WebLI-100B) to test the limits of data scaling laws in vision-language pre-training.
Demonstrates a divergence in scaling behaviors: traditional tasks saturate, while 'inclusive' tasks (cultural diversity, low-resource languages) continue to improve significantly.
Reveals that standard quality filters (e.g., keeping only 'aligned' data via CLIP) actively harm cultural representation, suggesting a trade-off between benchmark metric optimization and global inclusivity.

Evaluation Highlights

+5.8% absolute improvement on Dollar Street 10-shot classification (geo-localization) for ViT-L when scaling from 10B to 100B examples.
No significant improvement on ImageNet zero-shot accuracy (p-value 0.9) when scaling from 10B to 100B examples.
Consistent gains in low-resource languages on Crossmodal-3600 retrieval, widening the gap between low- and high-resource language performance as model size increases.

Breakthrough Assessment

8/10

While it doesn't propose a new architecture, the empirical validation of 100B-scale training is a significant milestone. It fundamentally shifts the motivation for scaling from 'better ImageNet accuracy' to 'better cultural inclusivity'.

⚙️ Technical Details

Problem Definition

Setting: Contrastive Vision-Language Pre-training (CLIP-style) on noisy web data

Inputs: Pairs of images and associated web text (alt-text or page titles)

Outputs: Visual and text embeddings optimized to be close for paired examples and distant for unpaired ones

Pipeline Flow

Web Crawling & Filtering (Result: WebLI-100B)
Image/Text Encoding (ViT & mt5)
Contrastive Pre-training (SigLIP loss)
Downstream Evaluation (Zero-shot & Transfer)

System Modules

Data Pipeline

Ingest and filter raw web data to create training splits

Model or implementation: Heuristic filters (PII removal, harmful content removal)

Image Encoder (Pre-training)

Extract feature representations from images

Model or implementation: Vision Transformer (ViT-B/16, ViT-L/16, ViT-H/14)

Text Encoder (Pre-training)

Extract feature representations from text

Model or implementation: Transformer (matches image encoder size)

Modeling

Base Model: SigLIP (Sigmoid Loss for Language Image Pre-training) with ViT backbones (B/16, L/16, H/14)

Training Method: Contrastive Pre-training

Training Data:

WebLI-100B: 100 billion image-text pairs (raw)
Subsets: 1B (1%) and 10B (10%) sampled randomly
Language rebalancing: Low-resource languages (e.g., Bengali, Swahili) upsampled to 1% each

Key Hyperparameters:

batch_size: 32768 (32K)
learning_rate: 0.001
weight_decay: 0.0001
+ 3 more
warmup_examples: 200 million
image_resolution: 224x224
max_text_length: 64 tokens

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. LAION-5B: WebLI-100B is 20x larger and emphasizes raw, noisy data over aggressive filtering
vs. JFT-3B: WebLI-100B is multimodal (image-text) rather than noisy classification labels
vs. Standard Scaling Laws [Kaplan et al.]: Finds saturation on standard metrics but continued scaling on 'cultural' metrics, refining the understanding of where scaling helps

Limitations

Western-centric benchmarks saturate, making it hard to measure progress without new metrics.
Fairness metrics like gender bias (Representation Bias) do not improve with scale (remains ~85% Male skew).
Transfer to generative models (PaliGemma) shows inconsistent gains, suggesting contrastive scaling doesn't perfectly translate to generation.
Training on 100B examples is computationally prohibitive for most research groups.

Reproducibility

WebLI-100B is a private Google dataset and not released. The paper investigates scaling laws using this proprietary data. Code availability is not mentioned (standard for large-scale Google infrastructure papers). Replication is effectively impossible for external researchers due to data scale.

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot evaluation across diverse vision-language tasks

Benchmarks:

ImageNet (Zero-shot Image Classification)
Dollar Street (Cultural/Geo-localization (Zero-shot & 10-shot))
Crossmodal-3600 (XM3600) (Multilingual Image-Text Retrieval)
GeoDE (Geographically Diverse Object Recognition)

Metrics:

Top-1 Accuracy
Recall@1
Representation Bias (RB)
Association Bias (AB)
Statistical methodology: Wilcoxon's signed rank test reported for comparing model performance across scales.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Standard benchmarks show saturation or negligible gains when scaling from 10B to 100B examples.
ImageNet (Zero-shot)	Accuracy	80.4	80.6	+0.2
Cultural diversity benchmarks show significant improvements, highlighting the benefit of massive scale for long-tail concepts.
Dollar Street (10-shot)	Accuracy	35.9	41.7	+5.8
Dollar Street (10-shot)	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper
Fairness analysis reveals that scaling data does not automatically fix societal biases.
ImageNet Gender Representation	Male Association Probability	0.85	0.85	0.0

Experiment Figures

Summary of improvements in cultural diversity and multilinguality achieved through data scaling

Performance disparity between high-resource and low-resource languages on Crossmodal-3600 as model size increases

Main Takeaways

Traditional Western-centric benchmarks (COCO, ImageNet) saturate at 100B scale, suggesting they are insufficient for measuring progress in massive-scale VLP.
Data scaling is critical for inclusivity: significant gains are observed in cultural geo-localization (Dollar Street) and low-resource languages (XM3600), where long-tail data coverage is key.
Quality filtering (e.g., using CLIP to remove 'noisy' data) inadvertently harms cultural diversity, improving standard metrics at the cost of inclusivity.
Societal biases (gender associations) are persistent and do not vanish with scale alone, requiring explicit mitigation strategies beyond just 'more data'.

📚 Prerequisite Knowledge

Prerequisites

Contrastive Language-Image Pre-training (CLIP)
Scaling laws (power laws linking data size/compute to performance)
Vision Transformers (ViT)

Key Terms

WebLI-100B: A private dataset constructed by the authors containing 100 billion image-text pairs from the web, used to test scaling limits

SigLIP: Sigmoid Loss for Language Image Pre-training—a stable and efficient loss function for training CLIP-style models, used as the primary training objective here

Dollar Street: A dataset of household items photographed in homes around the world, used to evaluate how well models recognize objects across different cultures and income levels

GeoDE: Geographically Diverse Evaluation—a benchmark for evaluating object recognition systems across geographically diverse regions

Crossmodal-3600: A multilingual image captioning evaluation dataset covering 36 geographically diverse languages

Representation Bias (RB): A metric measuring the disparity in how often a model associates different demographic groups (e.g., gender) with positive or negative concepts

Association Bias (AB): A metric measuring stereotypical associations (e.g., gender vs. occupation) in model predictions

mt5: Multilingual T5—a transformer-based language model pre-trained on a massive multilingual dataset, used here as the text encoder tokenizer

PII: Personally Identifiable Information—sensitive data removed during dataset construction

Zero-shot retrieval: Evaluating a model's ability to find relevant images for text (or vice versa) without seeing any labeled examples from that specific task during training