PTQ: Post-Training Quantization—converting a model to low-precision after training is complete, using only a small calibration set
ViT: Vision Transformer—a neural network architecture for computer vision based on self-attention mechanisms
LayerNorm: Layer Normalization—a technique to normalize neuron activities, often producing sparse outliers in Transformers
GELU: Gaussian Error Linear Unit—an activation function used in ViTs that produces an asymmetric distribution of values
GEMM: General Matrix Multiplication—the standard operation for dense matrix math in neural networks
SpMM: Sparse Matrix Multiplication—efficient matrix multiplication where one matrix contains mostly zeros
KL divergence: Kullback-Leibler divergence—a statistical distance measure used here to minimize the difference between original and quantized attention scores
Per-Patch Quantization: Calculating quantization parameters (scale/zero-point) independently for each image patch vector rather than the whole tensor
MHA: Multi-Head Attention—the core component of Transformers that computes dependencies between different parts of the input