PTQ: Post-Training Quantization—compressing a model using a small calibration set without full retraining
LayerNorm: Layer Normalization—a technique to normalize neuron activities; in Transformers, it often contains extreme outliers
Softmax: Activation function converting scores to probabilities; in Transformers, it exhibits a power-law distribution where a few values dominate
GPTQ: A state-of-the-art weight quantization method that reconstructs weights to minimize error layer-by-layer
Scale Reparameterization: Mathematically transforming model weights/biases to change the quantization scale requirements without altering the output
Channel-wise quantization: Using a separate quantization scale for each channel (accurate but expensive)
Layer-wise quantization: Using a single quantization scale for an entire layer (efficient but less accurate)
Log2 quantization: Quantization where levels are powers of 2, allowing multiplication to be replaced by bit-shifts
mAP: Mean Average Precision—a key metric for object detection performance