← Back to Paper List

The effectiveness of MAE pre-pretraining for billion-scale pretraining

Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra
Meta AI
arXiv (2023)
Pretraining MM Benchmark

📝 Paper Summary

Visual Representation Learning Self-Supervised Learning Foundation Models
A two-stage training strategy that initializes Vision Transformers with self-supervised Masked Autoencoding (MAE) on billion-scale data before weakly supervised pretraining, significantly improving convergence and downstream performance.
Core Problem
State-of-the-art vision models typically rely on either self-supervised learning or weakly supervised pretraining (WSP) on noisy data, but WSP from random initialization can be inefficient and fail to leverage the full structure of visual data.
Why it matters:
  • Training foundation models on billions of images is computationally expensive, so improved convergence speeds reduce costs.
  • Standard weakly supervised pretraining often struggles with low-shot or zero-shot transfer tasks compared to self-supervised methods.
  • Prior work assumed MAE (Masked Autoencoder) scales primarily with model size, potentially underestimating its utility on web-scale datasets.
Concrete Example: A standard ViT-Large model trained directly on Instagram-3B hashtags (WSP) achieves lower accuracy on fine-grained tasks like iNaturalist compared to a model that first learns to reconstruct masked Instagram images (MAE) and *then* trains on the hashtags.
Key Novelty
MAE-based Pre-pretraining (MAWS)
  • Introduces an initial 'pre-pretraining' stage using Masked Autoencoders (MAE) on the same billion-scale dataset used for later supervision, requiring no labels.
  • Demonstrates that MAE scales not just with model size (as previously thought) but also with dataset size (up to billions of images).
  • Combines the benefits of self-supervised structure learning (MAE) with the semantic alignment of weakly supervised pretraining (WSP).
Evaluation Highlights
  • Achieves 84.0% Top-1 accuracy on iNaturalist-18 with ViT-6.5B, outperforming previous MAE baselines significantly.
  • Sets a new state-of-the-art on 1-shot ImageNet-1k classification with 63.6% accuracy.
  • A ViT-2B model initialized with MAE pre-pretraining outperforms a much larger ViT-6.5B model trained with weak supervision alone.
Breakthrough Assessment
8/10
Provides strong empirical evidence that self-supervised initialization matters even at billion-scale data regimes, challenging the assumption that massive weak supervision is sufficient.
×