PTQ: Post-Training Quantization—compressing a model after training using only a small calibration dataset, without full retraining.
ViT: Vision Transformer—a model architecture based on self-attention mechanisms applied to sequences of image patches.
MHSA: Multi-Head Self-Attention—the core component of Transformers that computes relationships between different parts of the input.
LayerNorm: Layer Normalization—a technique to normalize neuron activities, typically acting as a barrier to quantization due to high variance.
Log2 Quantizer: A quantization scheme that maps values to powers of 2, often used for long-tail distributions like Softmax outputs.
SULQ: Shift-Uniform-Log2 Quantizer—the proposed quantizer that shifts inputs before log-transformation to ensure better domain coverage.
SOS: Smooth Optimization Strategy—the proposed 3-stage training pipeline to avoid local minima during quantization tuning.
Reparameterization: Mathematically transforming model parameters (e.g., merging scales into weights) to change the architecture structure or quantization scheme without altering output.
Block-wise reconstruction: Optimizing quantization parameters by minimizing the error between the output of a quantized block and the original full-precision block.