PTQ: Post-Training Quantization—converting a neural network to lower precision (e.g., integer) representation without full retraining, using only a small calibration dataset.
Mixed Precision: Assigning different bit-widths (e.g., 4-bit, 8-bit, 16-bit) to different layers of a network based on their sensitivity to noise.
SQNR: Signal-to-Quantization-Noise-Ratio—a metric used here to estimate a layer's sensitivity by comparing the power of the signal to the error introduced by quantization.
BOPs: Bit Operations—a metric for model efficiency calculated as the sum of MAC operations multiplied by their bit-widths, correlating with power consumption.
Quantizer Group: A set of weights and activations connected by shared operations (like element-wise add) that must share quantization parameters due to hardware constraints.
AdaRound: A specific PTQ algorithm that optimizes the rounding of weights (up or down) rather than just rounding to nearest, improving low-bit performance.
Pareto Frontier: The set of optimal solutions where no improvement can be made in one objective (e.g., accuracy) without sacrificing another (e.g., efficiency).