PTQ: Post-Training Quantization—compressing a model after training without full fine-tuning, usually using a small calibration dataset
Hessian: A matrix of second-order derivatives used to measure the curvature of the loss function; in quantization, it indicates how sensitive the error is to changes in specific weights
BRECQ: Block Reconstruction Quantization—a state-of-the-art PTQ method that optimizes weights to reconstruct the output of a full neural network block (e.g., a Transformer block)
AdaRound: Adaptive Rounding—a method that learns whether to round weights up or down to minimize quantization error, rather than just rounding to the nearest integer
Kronecker product: An operation on two matrices that results in a block matrix; used here to decompose large Hessian matrices into smaller, manageable components
INT2: 2-bit Integer Quantization—representing weights using only 2 bits (4 possible values), a highly aggressive compression level
PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower is better
OPTQ: A popular one-shot PTQ method that quantizes weights layer-by-layer using second-order information
Z-FOLD: A technique to effectively merge (fold) normalization parameters into weights to improve quantization resilience
Zero-shot task: Evaluating a model on a task it wasn't explicitly trained for, used here to verify reasoning capabilities after quantization