The effectiveness of MAE pre-pretraining for billion-scale pretraining

📝 Paper Summary

Visual Representation Learning Self-Supervised Learning Foundation Models

A two-stage training strategy that initializes Vision Transformers with self-supervised Masked Autoencoding (MAE) on billion-scale data before weakly supervised pretraining, significantly improving convergence and downstream performance.

Core Problem

State-of-the-art vision models typically rely on either self-supervised learning or weakly supervised pretraining (WSP) on noisy data, but WSP from random initialization can be inefficient and fail to leverage the full structure of visual data.

Why it matters:

Training foundation models on billions of images is computationally expensive, so improved convergence speeds reduce costs.
Standard weakly supervised pretraining often struggles with low-shot or zero-shot transfer tasks compared to self-supervised methods.
Prior work assumed MAE (Masked Autoencoder) scales primarily with model size, potentially underestimating its utility on web-scale datasets.

Concrete Example: A standard ViT-Large model trained directly on Instagram-3B hashtags (WSP) achieves lower accuracy on fine-grained tasks like iNaturalist compared to a model that first learns to reconstruct masked Instagram images (MAE) and *then* trains on the hashtags.

Key Novelty

MAE-based Pre-pretraining (MAWS)

Introduces an initial 'pre-pretraining' stage using Masked Autoencoders (MAE) on the same billion-scale dataset used for later supervision, requiring no labels.
Demonstrates that MAE scales not just with model size (as previously thought) but also with dataset size (up to billions of images).
Combines the benefits of self-supervised structure learning (MAE) with the semantic alignment of weakly supervised pretraining (WSP).

Evaluation Highlights

Achieves 84.0% Top-1 accuracy on iNaturalist-18 with ViT-6.5B, outperforming previous MAE baselines significantly.
Sets a new state-of-the-art on 1-shot ImageNet-1k classification with 63.6% accuracy.
A ViT-2B model initialized with MAE pre-pretraining outperforms a much larger ViT-6.5B model trained with weak supervision alone.

Breakthrough Assessment

8/10

Provides strong empirical evidence that self-supervised initialization matters even at billion-scale data regimes, challenging the assumption that massive weak supervision is sufficient.

⚙️ Technical Details

Problem Definition

Setting: Pretraining large-scale visual encoders (Vision Transformers) on web-scale image data with noisy labels.

Inputs: A set of images I and associated noisy text/hashtag labels Y (e.g., Instagram-3B).

Outputs: A pretrained visual encoder f(x) transferable to downstream tasks.

Pipeline Flow

Stage 1: MAE Pre-pretraining (Unsupervised on IG-3B)
Stage 2: WSP Pretraining (Supervised on IG-3B using hashtags)
Stage 3: Downstream Finetuning (Transfer Learning)

System Modules

MAE Encoder

Learn structural visual representations by reconstructing masked images.

Model or implementation: Vision Transformer (ViT-B to ViT-6.5B)

WSP Encoder

Align visual features with semantic classes using noisy hashtag labels.

Model or implementation: Vision Transformer (Initialized from MAE Encoder weights)

Novel Architectural Elements

Sequential pretraining pipeline: MAE (structure) → WSP (semantics) on the *same* billion-scale dataset.

Modeling

Base Model: Vision Transformer (ViT) variants: ViT-B (86M), ViT-L (307M), ViT-H (632M), ViT-2B (1.9B), ViT-6.5B (6.5B).

Training Method: MAE Pre-pretraining followed by Weakly Supervised Pretraining (WSP)

Objective Functions:

Purpose: Learn visual structure.

Formally: MSE reconstruction loss on masked patches (MAE stage).
Purpose: Learn semantic categories.

Formally: Multi-label Cross-Entropy loss on hashtag targets (WSP stage).

Adaptation: Full model finetuning for downstream tasks.

Training Data:

Instagram-3B (IG-3B): 3 billion unique images, resampled to 5 billion total, 28k classes mapped from hashtags.

Key Hyperparameters:

mae_masking_ratio: 0.75
mae_epochs: 1 epoch on IG-3B
wsp_epochs: 1 epoch on IG-3B
+ 1 more
resolution: 224x224 (pretraining)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. SWAG: Adds the MAE pre-pretraining stage; shows improved convergence and low-shot performance.
vs. MAE (Original): Scales MAE to billion-scale data (IG-3B) instead of just ImageNet-1k.
vs. CoCa: Uses a simple sequential pipeline (MAE then WSP) rather than a joint contrastive/generative loss.

Limitations

Relies on a proprietary dataset (Instagram-3B) which is not public.
Requires two stages of training, which increases total training wall-clock time compared to single-stage WSP (though convergence is faster in terms of epochs).
Discrepancy noted between abstract claims (91.7% iNat18) and table values (84.0% iNat18), potentially due to metric definitions (e.g., Top-5 vs Top-1).

Reproducibility

Code: https://github.com/facebookresearch/maws

Code is publicly available at https://github.com/facebookresearch/maws. The Instagram-3B (IG-3B) dataset is proprietary and not released, limiting full reproducibility to Meta researchers. However, experiments on ImageNet-21k are provided as a reproducible proxy.

📊 Experiments & Results

Evaluation Setup

Pretrain on IG-3B, then transfer to various downstream tasks via finetuning, linear probing, or zero-shot transfer (via LiT).

Benchmarks:

ImageNet-1k (IN1k) (Image Classification)
iNaturalist-18 (iNat18) (Fine-grained Classification)
LVIS (Long-tailed Object Detection)
Kinetics-400 (K400) (Video Action Recognition)

Metrics:

Top-1 Accuracy
AP (Average Precision) for box/mask
1-shot / 5-shot Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Classification results on ImageNet-1k and iNaturalist-18 showing the scaling behavior of MAE->WSP (MAWS).
ImageNet-1k	Top-1 Accuracy	88.6	89.3	+0.7
iNaturalist-18	Top-1 Accuracy	81.1	82.3	+1.2
ImageNet-1k	Top-1 Accuracy	85.8	87.0	+1.2
Low-shot classification results demonstrating the label efficiency of the pre-pretrained representations.
ImageNet-1k (1-shot)	Top-1 Accuracy	59.4	57.1	-2.3
ImageNet-1k (1-shot)	Top-1 Accuracy	Not reported in the paper	63.6	Not reported in the paper
LVIS	AP_box	47.1	50.8	+3.7

Experiment Figures

Transfer performance of ViT-L trained with MAE, WSP, and MAE->WSP across 10 tasks.

Scaling behavior of MAE->WSP vs WSP for models ranging from 0.1B to 6.5B parameters.

Main Takeaways

MAE pre-pretraining scales with dataset size (Instagram-3B vs ImageNet-1k) and model size (up to 6.5B parameters), contrary to prior beliefs that MAE only scales with model size.
Pre-pretraining consistently improves convergence speed; a model pre-pretrained for 0.1 epochs matches the performance of a randomly initialized model trained for significantly longer.
The method is particularly effective for transfer tasks like object detection (LVIS) and low-shot classification, where pure weakly supervised pretraining often underperforms.
A 2B parameter model using MAE->WSP outperforms a 6.5B parameter model using only WSP, highlighting parameter efficiency gains.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformers (ViT) architecture
Masked Autoencoders (MAE)
Weakly Supervised Learning vs. Self-Supervised Learning
Linear probing vs. Finetuning evaluation

Key Terms

MAE: Masked Autoencoder—a self-supervised method that masks a high percentage (e.g., 75%) of the image and trains the model to reconstruct the missing pixels.

WSP: Weakly Supervised Pretraining—training models using noisy, naturally occurring labels like hashtags or captions found on the internet.

Pre-pretraining: An initial unsupervised training phase used to initialize model weights before the main pretraining (WSP) phase.

IG-3B: Instagram-3B—a proprietary dataset containing approximately 3 billion images with hashtag annotations.

ViT: Vision Transformer—a neural network architecture for computer vision based on the Transformer architecture used in NLP, processing images as sequences of patches.

LiT: Locked-image Tuning—a method to align a frozen image encoder with a text encoder for zero-shot classification.

Linear Probe: Evaluating a pretrained model by freezing its weights and training a simple linear classifier on top.

1-shot classification: Evaluating the model's ability to classify images given only one labeled example per class.