COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

📝 Paper Summary

Post-Training Quantization (PTQ) Model Compression

COMQ is a backpropagation-free quantization algorithm that iteratively minimizes layer-wise reconstruction error by updating single weights or scaling factors in a greedy coordinate-descent manner.

Core Problem

Post-training quantization often suffers significant accuracy degradation, especially for Vision Transformers (ViTs), while existing methods either require computationally expensive backpropagation (like QAT) or rely on complex Hessian inversions.

Why it matters:

Deploying large models (ViTs, LLMs) on resource-constrained devices requires reducing memory footprint via low-bit quantization without retraining.
Existing PTQ methods struggle to balance high accuracy with computational efficiency, often failing to preserve performance in sensitive architectures like ViTs at low bit-widths (e.g., 4-bit).
Backpropagation-based methods are slow and memory-intensive, while simpler rounding methods cause too much error.

Concrete Example: Standard rounding (Nearest-Neighbor) for a 4-bit Vision Transformer might drop accuracy by over 50% because it ignores the interplay between weights. COMQ iteratively adjusts one weight at a time to compensate for errors introduced by quantizing others, recovering nearly full accuracy.

Key Novelty

Coordinate Minimization for Quantization (COMQ)

Reformulates quantization as a coordinate descent problem where every quantized weight (integer code) and scaling factor is a variable to be optimized.
Updates one variable at a time (greedy selection) to minimize local reconstruction error while keeping others fixed, using a closed-form solution that requires only dot products and rounding.
Does not use backpropagation or Hessian inverses, making it computationally lightweight and hyperparameter-free.

Architecture

Illustration of the coordinate-wise update process. It shows a weight matrix W and input X. The algorithm selects one element of the quantized weight matrix Q (an integer bit-code) or the scaling factor delta to update while fixing all others.

Evaluation Highlights

Achieves 4-bit quantization of Vision Transformers (DeiT-S, DeiT-B, ViT-B) with <1% Top-1 accuracy loss compared to full precision.
Maintains near-lossless accuracy (0.05% drop) for 4-bit quantization of ResNet-18 and ResNet-50 on ImageNet.
Outperforms state-of-the-art PTQ methods (like BRECQ, QDrop, PD-Quant) on ViTs with per-channel quantization.

Breakthrough Assessment

8/10

Offers a highly efficient, mathematically grounded alternative to backpropagation-based PTQ. It achieves SOTA results on difficult ViT architectures without the complexity of Hessian-based methods or the cost of QAT.

⚙️ Technical Details

Problem Definition

Setting: Layer-wise weight quantization minimizing the squared Frobenius norm of the reconstruction error between full-precision and quantized outputs.

Inputs: Pre-trained floating-point weight matrix W and layer input calibration data X.

Outputs: Quantized weight matrix W_q decomposed into scaling factor(s) delta and integer bit-codes Q.

Pipeline Flow

Initialization (Initialize scaling factor delta and bit-codes Q)
Iterative Update Loop (Update bit-codes Q and delta)
Final Quantized Model Generation

System Modules

Initializer

Sets initial scaling factor based on average infinity norm of weights to smooth out outliers.

Coordinate Solver (Bit-code) (Iterative Update)

Updates a single element (or column) of the bit-code matrix Q to minimize reconstruction error.

Coordinate Solver (Scale) (Iterative Update)

Updates the floating-point scaling factor delta given fixed bit-codes.

Greedy Selector (Iterative Update)

Decides the order in which to update variables (bit-codes vs. scale) to maximize error reduction.

Novel Architectural Elements

Greedy coordinate-descent strategy applied specifically to the discrete problem of weight quantization.
Joint optimization of integer bit-codes and floating-point scalars without gradient descent or Hessian approximation.

Modeling

Base Model: Evaluated on ResNet-18/50, MobileNetV2, RegNetX (CNNs) and ViT-B, DeiT-S/B, Swin-S/B (Transformers).

Training Method: Coordinate Descent (COMQ)

Objective Functions:

Purpose: Minimize the difference between the output of the full-precision layer and the quantized layer.

Formally: min || X*W_q - X*W ||^2 over quantized weights W_q.

Adaptation: Post-training calibration only (no fine-tuning of original weights)

Training Data:

1024 randomly sampled images from ImageNet training set for calibration.

Key Hyperparameters:

iterations: Typically 1-2 rounds (epochs) over all weights is sufficient.
batch_size: 32 (for calibration data processing)
learning_rate: None (analytical solution)

Compute: Not explicitly reported in the paper (in terms of wall-clock time), but described as faster than backpropagation-based methods.

Comparison to Prior Work

vs. AdaRound/BRECQ/QDrop: COMQ is backpropagation-free and does not require hyperparameter tuning (LR, epochs, regularizers).
vs. OBC: COMQ does not require computing or inverting the Hessian matrix, reducing memory and complexity.
vs. QuantEase [not cited in paper]: Similar coordinate descent approach, but COMQ specifically introduces a greedy update order and learnable scaling factors tailored for ViTs.

Limitations

Greedy selection strategy introduces computational overhead compared to cyclic updates due to checking all candidates.
Requires access to calibration data (not data-free).
Main focus is weight quantization; activation quantization relies on existing techniques.

Reproducibility

Code: https://github.com/AozhongZhang/COMQ

Code is publicly available at https://github.com/AozhongZhang/COMQ. Uses standard ImageNet dataset for calibration. Hyperparameter-free method (no learning rate or weight decay to tune).

📊 Experiments & Results

Evaluation Setup

Image Classification on ImageNet (ILSVRC-2012).

Benchmarks:

ImageNet (Image Classification)

Metrics:

Top-1 Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison on Convolutional Neural Networks (ResNet/MobileNet/RegNet) shows COMQ matches or exceeds SOTA backprop-based methods.
ImageNet	Top-1 Accuracy	71.47	71.42	-0.05
ImageNet	Top-1 Accuracy	71.11	71.42	+0.31
ImageNet	Top-1 Accuracy	71.36	71.42	+0.06
Comparison on Vision Transformers (ViT/DeiT/Swin) demonstrates significant robustness at 4-bit settings.
ImageNet	Top-1 Accuracy	84.54	83.84	-0.70
ImageNet	Top-1 Accuracy	81.55	83.84	+2.29
ImageNet	Top-1 Accuracy	82.57	83.84	+1.27

Experiment Figures

Comparison of different update orders (Cyclic vs. Greedy) on quantization error reduction.

Main Takeaways

COMQ consistently matches or beats gradient-based and Hessian-based PTQ methods across both CNNs and ViTs.
The greedy coordinate update strategy is particularly effective for lower bit-widths (e.g., 3-bit, 4-bit) where sensitivity to individual weight errors is higher.
The method is robust to different architectures (CNN vs Transformer) without needing architecture-specific hyperparameter tuning.

📚 Prerequisite Knowledge

Prerequisites

Post-Training Quantization (PTQ) concepts
Coordinate Descent optimization
Matrix factorization / linear algebra
Basic understanding of Neural Network layers (Linear, Conv)

Key Terms

PTQ: Post-Training Quantization—compressing a pre-trained model to lower precision (e.g., 8-bit, 4-bit) without full retraining.

QAT: Quantization Aware Training—simulating quantization during the training process to allow the model to adapt, usually yielding better accuracy but requiring full training resources.

Coordinate Descent: An optimization algorithm that minimizes a function by successively minimizing along coordinate directions (one variable at a time).

Hessian: A square matrix of second-order partial derivatives of a scalar-valued function; commonly used in optimization to determine curvature but expensive to compute/invert.

ViT: Vision Transformer—a model architecture based on the Transformer mechanism applied to image patches.

Bit-code: The integer representation of a weight in a quantized model (e.g., an integer from -8 to 7 for 4-bit signed quantization).

Calibration data: A small set of real data samples used to statistically adjust quantization parameters post-training, without using the full training set.

Greedy selection: Choosing the next variable to update based on which one offers the maximum immediate reduction in the objective function.