PTQ: Post-Training Quantization—compressing a model after training without a full retraining process
Layer-wise quantization: Quantizing a model one layer at a time, sequentially, to reduce computational complexity compared to global optimization
Hessian: Second-order derivative matrix used in optimization; in PTQ, usually approximated by X*X_transpose to measure parameter sensitivity
Calibration dataset: A small set of real data used to guide the quantization process and estimate statistics, avoiding the need for the full training set
GPTQ: Generative Pre-trained Transformer Quantization—a popular layer-wise PTQ method that quantizes weights based on inverse Hessian information
AWQ: Activation-aware Weight Quantization—a method that protects salient weights by scaling them before quantization
QuIP: Quantization with Incoherence Processing—a method that uses incoherent matrices to rotate weights, reducing outliers
MLP: Multilayer Perceptron—fully connected layers within Transformer blocks, often containing the bulk of parameters
Overfitting: In this context, tuning quantized weights so precisely to the calibration data that they perform worse on unseen data