← Back to Paper List

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

X Wang, I Alabdulmohsin, D Salz, Z Li, K Rong, X Zhai
Google DeepMind
arXiv, 2/2025 (2025)
Pretraining MM Benchmark

📝 Paper Summary

Vision-Language Pre-training (VLP) Scaling Laws Dataset Curation Fairness and Inclusivity in AI
Scaling noisy web data to 100 billion examples yields negligible gains on traditional Western-centric benchmarks but significantly improves cultural diversity, long-tail concept recognition, and performance on low-resource languages.
Core Problem
Existing vision-language datasets have plateaued around 10 billion examples, and it is unknown whether pushing this scale by an order of magnitude (to 100B) yields further benefits or diminishing returns.
Why it matters:
  • Standard benchmarks (like ImageNet) may be saturated, masking the potential benefits of scaling for other critical dimensions like cultural inclusivity.
  • Quality filtering techniques (like CLIP filtering) used to maximize benchmark performance might inadvertently harm diversity by removing long-tail cultural concepts.
  • Understanding the trade-offs of massive data scaling is crucial for resource allocation in training next-generation multimodal systems.
Concrete Example: A ViT-L model trained on 10 billion examples achieves only 35.9% accuracy on the culturally diverse Dollar Street geo-localization task. When trained on 100 billion examples, the same model jumps to 41.7% accuracy, recognizing household items from underrepresented regions that smaller datasets miss.
Key Novelty
WebLI-100B: The first empirical study of 100-billion scale VLP
  • Constructs a massive dataset of 100 billion image-text pairs (WebLI-100B) to test the limits of data scaling laws in vision-language pre-training.
  • Demonstrates a divergence in scaling behaviors: traditional tasks saturate, while 'inclusive' tasks (cultural diversity, low-resource languages) continue to improve significantly.
  • Reveals that standard quality filters (e.g., keeping only 'aligned' data via CLIP) actively harm cultural representation, suggesting a trade-off between benchmark metric optimization and global inclusivity.
Evaluation Highlights
  • +5.8% absolute improvement on Dollar Street 10-shot classification (geo-localization) for ViT-L when scaling from 10B to 100B examples.
  • No significant improvement on ImageNet zero-shot accuracy (p-value 0.9) when scaling from 10B to 100B examples.
  • Consistent gains in low-resource languages on Crossmodal-3600 retrieval, widening the gap between low- and high-resource language performance as model size increases.
Breakthrough Assessment
8/10
While it doesn't propose a new architecture, the empirical validation of 100B-scale training is a significant milestone. It fundamentally shifts the motivation for scaling from 'better ImageNet accuracy' to 'better cultural inclusivity'.
×