SSL: Self-Supervised Learning—learning representations from unlabeled data by solving pretext tasks like matching different views of the same image.
View: An augmented version of an original image, created by applying transformations like cropping, resizing, and color distortion.
Discriminative Learning: Approaches that learn representations by distinguishing between positive pairs (same image) and negative pairs (different images) or by attracting positive pairs.
SimSiam: A non-contrastive SSL method that maximizes similarity between two views of an image using a Siamese network with a stop-gradient operation, without negative pairs.
DINO: Self-distillation with no labels—an SSL method using Vision Transformers where a student network predicts the output of a momentum teacher network.
iBOT: Image BERT pre-training with Online Tokenizer—an SSL method combining masked image modeling with self-distillation.
SimCLR: A simple framework for contrastive learning of visual representations that maximizes agreement between differently augmented views of the same data example.
ViT: Vision Transformer—a model architecture based on the Transformer mechanism, applied to image patches instead of text tokens.
RRC: Random Resized Crop—a standard data augmentation technique in computer vision.