MAE: Masked Autoencoder—a self-supervised method that masks a high percentage (e.g., 75%) of the image and trains the model to reconstruct the missing pixels.
WSP: Weakly Supervised Pretraining—training models using noisy, naturally occurring labels like hashtags or captions found on the internet.
Pre-pretraining: An initial unsupervised training phase used to initialize model weights before the main pretraining (WSP) phase.
IG-3B: Instagram-3B—a proprietary dataset containing approximately 3 billion images with hashtag annotations.
ViT: Vision Transformer—a neural network architecture for computer vision based on the Transformer architecture used in NLP, processing images as sequences of patches.
LiT: Locked-image Tuning—a method to align a frozen image encoder with a text encoder for zero-shot classification.
Linear Probe: Evaluating a pretrained model by freezing its weights and training a simple linear classifier on top.
1-shot classification: Evaluating the model's ability to classify images given only one labeled example per class.