PTQ: Post-Training Quantization—compressing a model after training using only a small calibration dataset, without full retraining
ViT: Vision Transformer—a neural network architecture for computer vision based on the Transformer mechanism, using image patches as tokens
LoRA: Low-Rank Adaptation—a technique to fine-tune models by adding small, low-rank matrices to existing weights rather than updating all parameters
Softmax: A mathematical function that converts a vector of numbers into a vector of probabilities, used in Transformers to calculate attention scores
Log2 Quantizer: A quantization method that uses a logarithmic scale, often used for data with long-tail distributions like Softmax outputs
NAS: Network Architecture Search—automating the design of neural network architectures (here used to find the optimal rank for compensation matrices)
MHSA: Multi-Head Self-Attention—the core component of Transformers that allows the model to attend to different parts of the input sequence simultaneously