A Practical Mixed Precision Algorithm for Post-Training Quantization

📝 Paper Summary

Post-Training Quantization (PTQ) Model Compression

A post-training quantization algorithm that automatically assigns layer-specific bit-widths by measuring sensitivity via Signal-to-Quantization-Noise-Ratio (SQNR) and performing a greedy search to maximize accuracy under efficiency budgets.

Core Problem

Assigning the same bit-width to all layers (homogeneous quantization) is inefficient because layers vary in sensitivity, but existing mixed-precision methods require expensive retraining or complex hyperparameter tuning.

Why it matters:

Standard quantization (e.g., 8-bit) often degrades accuracy significantly for compact networks like MobileNetV3 or Transformers like BERT/ViT.
Existing mixed-precision solutions often require access to full labeled training datasets, which may not be available during deployment due to privacy or storage constraints.
Manual selection of bit-widths for each layer is intractable given the vast search space of deep neural networks.

Concrete Example: In a Vision Transformer (ViT), standard W8A8 (8-bit weights/activations) quantization causes accuracy to crash to 18.83% due to outliers. The proposed mixed precision algorithm identifies sensitive layers, keeps them at higher precision, and recovers accuracy to 80.58%.

Key Novelty

SQNR-based Greedy Pareto Search

Uses Signal-to-Quantization-Noise-Ratio (SQNR) as a fast, label-free proxy to measure how much each layer's output is corrupted by quantization noise.
Employing a greedy strategy that starts with the model at maximum precision and iteratively lowers the bit-width of the 'least sensitive' layer (highest SQNR) until an efficiency or accuracy budget is hit.
Optimizes the search speed using binary and interpolation search to find the optimal configuration on the Pareto frontier in logarithmic time.

Architecture

Pseudocode of the two-phase algorithm: Sensitivity Analysis followed by Greedy Search.

Evaluation Highlights

+2.90% Top-1 accuracy improvement on MobileNetV3 compared to standard W8A8 (8-bit weights/activations) fixed precision.
Recovers BERT (MNLI) accuracy from 74.13% (W8A8) to 82.97% (Mixed Precision), nearly matching the FP32 baseline of 84.40%.
Reduces search time for mixed precision configurations from 14.1 hours (sequential search) to 0.4 hours (binary + interpolation search) for MobileNetV3.

Breakthrough Assessment

8/10

Provides a highly practical, data-efficient, and hyperparameter-free solution to a critical deployment problem. While not a new architecture, its ability to fix broken quantized models (like ViT/BERT) without retraining is significant for industry adoption.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) where a pre-trained full-precision network is converted to mixed bit-widths using a small calibration set.

Inputs: Full precision network weights W and a small unlabeled calibration dataset (e.g., 256 images).

Outputs: A mixed-precision quantized network with bit-width assignments {b_l} for each layer l satisfying a performance or efficiency budget.

Modeling

Base Model: Evaluated on ResNet18/50, MobileNetV2/V3, EfficientNet-lite/b0, DeepLabV3, BERT, ViT.

Training Method: Greedy Sensitivity-based Bit-width Selection (PTQ)

Objective Functions:

Purpose: Measure sensitivity of a quantizer q at bit-width b.

Formally: SQNR_{q,b} = 10 * log10( E[F(x)^2] / E[(F(x) - Q(F(x)))^2] )
Purpose: Efficiency budget metric.

Formally: BOPs = Sum( MAC(op_i) * bits(weight_i) * bits(act_i) )

Adaptation: Post-training quantization only (no gradient updates to weights, only bit-width selection and scale calibration)

Training Data:

Calibration: 256 samples from ImageNet (CV) or GLUE (NLP)
Ablation studies use MS-COCO as out-of-domain calibration data

Key Hyperparameters:

calibration_size: 256 samples
candidate_bit_widths: W4A8, W8A8, W8A16 (or expanded set W4A4...W8A16)
quantization_scheme: Per-channel (weights), Symmetric/Asymmetric (activations)

Compute: Search time: ~0.3 to 1.6 hours per model using Binary+Interpolation search on standard hardware (exact GPU not specified, but implied to be fast).

Comparison to Prior Work

vs. HAWQ: Uses SQNR (evaluation-based) instead of Hessian approximations (computationally expensive second-order info).
vs. DNAS/HAQ: Purely post-training search without reinforcement learning or retraining loops.
vs. Fixed Precision: Allocates bits dynamically to improve the Pareto trade-off.
+ 1 more
vs. FracBits [cited]: Uses discrete hardware-friendly bit-widths rather than fractional bit-widths.

Limitations

The greedy approach assumes layer independence (mostly handled by Quantizer Groups, but interaction effects might be missed).
Requires hardware that supports mixed precision execution to realize efficiency gains.
Sensitivity metric (SQNR) is a proxy and might not perfectly correlate with task loss in all edge cases.

Reproducibility

Code: https://github.com/quic/aimet

Method is part of the open-source AIMET toolkit (https://github.com/quic/aimet). The paper provides detailed algorithms (Algorithm 1) and search space definitions. Standard datasets (ImageNet, GLUE) and models are used.

📊 Experiments & Results

Evaluation Setup

Post-training quantization on pre-trained models using a small calibration set.

Benchmarks:

ImageNet-1K (Image Classification)
GLUE (MNLI, RTE, etc.) (Natural Language Understanding)
Pascal VOC (Semantic Segmentation)

Metrics:

Top-1 Accuracy
Relative BOPs (Bit Operations relative to W8A16)
mIoU (for segmentation)
Statistical methodology: Kendall-Tau correlation used to compare sensitivity metrics.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison between Fixed Precision (W8A8) and Mixed Precision (MP) shows recovery of accuracy in sensitive models.
ImageNet	Top-1 Accuracy	68.75	71.65	+2.90
GLUE (MNLI)	Accuracy	74.13	82.97	+8.84
ImageNet	Top-1 Accuracy	18.83	80.58	+61.75
ImageNet	Top-1 Accuracy	9.55	61.81	+52.26
Run-time analysis of the search algorithm.
MobileNetV3 Search	Hours	14.1	0.4	-13.7

Experiment Figures

Pareto curves comparing different sensitivity metrics (Accuracy vs SQNR vs FIT) robustness to data subsets.

Box plots of SQNR values for W8A8 quantizers across different networks.

Main Takeaways

Mixed precision is critical for networks with 'outlier' layers (ViT, MobileNetV3, BERT), where fixed quantization fails catastrophically.
For robust networks like ResNet18/50, mixed precision offers negligible gains over fixed W8A8, as layer sensitivities are uniform.
The SQNR sensitivity metric is robust: using out-of-domain data (MS-COCO) for ImageNet models produces similar Pareto curves to using in-domain data.
Integrating AdaRound into the mixed precision search further improves low-bit performance (e.g., W4A8 regimes).

📚 Prerequisite Knowledge

Prerequisites

Neural Network Quantization (symmetric vs asymmetric)
Post-Training Quantization (PTQ) vs Quantization-Aware Training (QAT)
Pareto Frontier

Key Terms

PTQ: Post-Training Quantization—converting a neural network to lower precision (e.g., integer) representation without full retraining, using only a small calibration dataset.

Mixed Precision: Assigning different bit-widths (e.g., 4-bit, 8-bit, 16-bit) to different layers of a network based on their sensitivity to noise.

SQNR: Signal-to-Quantization-Noise-Ratio—a metric used here to estimate a layer's sensitivity by comparing the power of the signal to the error introduced by quantization.

BOPs: Bit Operations—a metric for model efficiency calculated as the sum of MAC operations multiplied by their bit-widths, correlating with power consumption.

Quantizer Group: A set of weights and activations connected by shared operations (like element-wise add) that must share quantization parameters due to hardware constraints.

AdaRound: A specific PTQ algorithm that optimizes the rounding of weights (up or down) rather than just rounding to nearest, improving low-bit performance.

Pareto Frontier: The set of optimal solutions where no improvement can be made in one objective (e.g., accuracy) without sacrificing another (e.g., efficiency).