VPT: Visual Prompt Tuning—injecting learnable tokens into the input sequence of a frozen ViT to adapt it to downstream tasks.
MAE: Masked Autoencoder—a self-supervised pre-training method that learns by reconstructing masked patches of an image.
SPT: Self-Prompt Tuning—the proposed method of initializing prompts using prototypes or samples from the target dataset's features.
prototypes: Representative feature vectors obtained by clustering the patch embeddings of the target dataset.
NMI: Normalized Mutual Information—a metric used here to measure the statistical dependence between prompt tokens and image patch tokens.
ViT: Vision Transformer—a model architecture that processes images as sequences of patch embeddings using self-attention mechanisms.
Full fine-tuning: Updating all parameters of a pre-trained model during adaptation to a new task.
patch tokens: The intermediate feature representations of image patches within the Transformer layers.
prompt tokens: Learnable vectors inserted into the Transformer input sequence to steer the model's behavior without changing its weights.
inertia: The sum of squared distances of samples to their closest cluster center, minimized during K-means clustering.