ViT: Vision Transformer—a model architecture based on self-attention mechanisms rather than convolutions, originally designed for NLP but applied to images
MHSA: Multi-Head Self-Attention—a mechanism allowing the model to jointly attend to information from different representation subspaces at different positions
SHSA: Single-Head Self-Attention—the proposed module that uses one attention head on a subset of channels to reduce redundancy
Patchify Stem: The initial layers of a ViT that convert the input image into a sequence of embeddings (patches)
Depthwise Convolution: A convolution that applies a single filter per input channel, reducing computational cost compared to standard convolution
MetaFormer: A generalized architecture abstracting the specific token mixer (attention, pooling, etc.) from the overall Transformer block structure
AP: Average Precision—a common metric for object detection and segmentation accuracy
ONNX: Open Neural Network Exchange—an open format for representing machine learning models, often used for optimizing inference speed
Throughput: The number of images a model can process per second
Inductive bias: Assumptions built into a learning algorithm (like spatial locality in CNNs) that help it learn effectively with less data