RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization

📝 Paper Summary

Post-Training Quantization (PTQ) Model Compression Transformer Quantization

RepQuant decouples quantization from inference by employing complex, distribution-aware quantizers during calibration and converting them into efficient, hardware-friendly quantizers for inference via mathematically equivalent scale reparameterization.

Core Problem

Large transformer models exhibit extreme activation distributions (severe inter-channel variations in LayerNorm, power-law characteristics in Softmax) that cause significant accuracy loss when quantized with simple hardware-friendly quantizers.

Why it matters:

Existing Post-Training Quantization (PTQ) methods over-prioritize hardware compatibility during the calibration phase, leading to poor representation of outliers
Low-bit quantization (e.g., 4-bit) of large transformers typically fails or degrades performance catastrophically without retraining
Large models (ViT, LLaMA) are computationally expensive; efficient inference on standard hardware requires simplified quantization schemes (layer-wise, log2) which normally sacrifice accuracy

Concrete Example: In DeiT-S, LayerNorm activation ranges vary by 33x across channels (0.8 vs 26.5). Forcing a single layer-wise quantization scale (hardware-friendly) lumps these distinct distributions together, causing massive error. RepQuant uses channel-wise quantization first to capture this, then mathematically folds the scales into weights to allow layer-wise inference.

Key Novelty

Quantization-Inference Decoupling via Scale Reparameterization

Use complex quantizers (channel-wise for LayerNorm, Log-sqrt2 for Softmax) during the calibration phase to accurately capture extreme distributions.
Transform these complex quantizers into simplified ones (layer-wise, Log2) for the inference phase through mathematically equivalent reparameterization of weights and biases.
Introduce learnable per-channel dual clipping to identify fine-grained upper and lower bounds for outliers, minimizing quantization error in the integer space.

Architecture

Overview of the RepQuant framework, contrasting the Quantization Process with the Inference Process.

Evaluation Highlights

+30.7% accuracy gain on ImageNet (ViT-S, W4/A4) compared to PTQ4ViT baseline (73.28% vs 42.57%).
+30.3 mAP improvement on COCO Object Detection (Mask R-CNN with Swin-T, W4/A4) compared to PTQ4ViT (37.2 vs 6.9).
Achieves <1% accuracy drop compared to full-precision models for most Vision Transformers in W6/A6 settings.

Breakthrough Assessment

8/10

Significantly outperforms existing PTQ methods in low-bit settings by resolving the conflict between accurate outlier representation and efficient hardware inference.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of pre-trained Transformer models (Vision and Language)

Inputs: Pre-trained full-precision model weights and a small calibration dataset

Outputs: Quantized model with low-bit weights and activations (e.g., 4-bit, 6-bit) optimized for efficient inference

Pipeline Flow

LayerNorm Activations: Channel-wise Quantization → Learn Dual Clipping → Scale Reparameterization → Layer-wise Quantization
Softmax Activations: Log-sqrt2 Quantization → Scale Reparameterization → Log2 Quantization
Weights: GPTQ (Quantized Weight Reconstruction)
Other Activations: Uniform Quantization

System Modules

LayerNorm Activation Quantizer

Capture inter-channel variations using channel-wise scales and learnable dual clipping bounds

Model or implementation: Channel-wise Uniform Quantizer + Per-channel Dual Clipping

LayerNorm Reparameterizer (Transformation Phase)

Convert channel-wise scales to layer-wise by adjusting LayerNorm affine factors and next-layer weights

Model or implementation: Mathematical Transformation

Softmax Reparameterizer (Transformation Phase)

Convert Log-sqrt2 quantization (dense) to Log2 (hardware friendly)

Model or implementation: Base Change Transformation

Novel Architectural Elements

Quantization-Inference Decoupling Paradigm: Using different quantizers for calibration vs. inference
Sequential Pipeline integration where LayerNorm reparameterization updates next-layer weights BEFORE GPTQ weight reconstruction

Modeling

Base Model: Evaluated on ViT (S/B), DeiT (T/S/B), Swin (S/B), LLaMA, OPT, CLIP

Training Method: Per-channel dual clipping optimization (activation) + GPTQ (weights)

Objective Functions:

Purpose: Find optimal clipping bounds for LayerNorm activations.

Formally: argmin_{alpha1, alpha2} ||X' - X'_quant(alpha)||^2

Adaptation: Scale reparameterization (weight/bias adjustment)

Key Hyperparameters:

clipping_optimization_steps: 100
clipping_learning_rate: 0.01
optimizer: Adam
+ 2 more
calibration_samples_vision: 1024
calibration_samples_language: 128

Compute: Single NVIDIA A6000 GPU for calibration

Comparison to Prior Work

vs. SmoothQuant: SmoothQuant adjusts distributions in FP space; RepQuant quantizes first (channel-wise) then aligns via reparameterization, minimizing bias in the actual quantization space.
vs. PTQ4ViT/FQ-ViT: These use simple hardware-friendly quantizers during calibration; RepQuant uses complex quantizers during calibration and simplifies them for inference.
vs. RepQ-ViT: RepQuant adds learnable dual clipping and integrates GPTQ sequentially (updating weights after reparam) rather than re-calibrating.

Limitations

Paper provided text only contains numeric results for Vision models (ViT/DeiT/Swin), though methodology claims applicability to LLaMA/OPT.
Requires reparameterization of weights, which modifies the model parameters (unlike pure activation quantization approaches).
Calibration process involves a lightweight optimization step (dual clipping), slightly slower than analytical clipping methods.

Reproducibility

Code availability is 'not provided'. The paper describes algorithms and reparameterization formulas in detail (Eq 9-17). Uses standard datasets (ImageNet, COCO).

📊 Experiments & Results

Evaluation Setup

Post-Training Quantization on Image Classification and Object Detection

Benchmarks:

ImageNet (Image Classification)
COCO (Object Detection & Instance Segmentation)

Metrics:

Top-1 Accuracy
Box AP (Average Precision)
Mask AP
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ImageNet	Top-1 Accuracy	42.57	73.28	+30.71
ImageNet	Top-1 Accuracy	69.03	75.21	+6.18
ImageNet	Top-1 Accuracy	67.48	78.46	+10.98
COCO	Box AP	6.9	37.2	+30.3
COCO	Box AP	47.7	50.4	+2.7
ImageNet	Top-1 Accuracy	81.39	80.51	-0.88

Experiment Figures

Boxplots of LayerNorm activation ranges across channels for DeiT-S and LLaMA-7B.

Main Takeaways

Decoupling quantization (complex) from inference (simple) via reparameterization yields massive gains in low-bit (4-bit) settings where baselines often crash.
LayerNorm and Softmax are the primary bottlenecks; treating them with specialized quantizers (channel-wise/log-sqrt2) is critical.
The method is robust across varying transformer architectures (ViT, DeiT, Swin) and tasks (Classification, Detection).
Sequential pipeline integration with GPTQ allows weight reconstruction to adapt to the reparameterized activations.

📚 Prerequisite Knowledge

Prerequisites

Principles of model quantization (scales, zero-points, bit-width)
Transformer architecture (LayerNorm, Softmax, Multi-Head Attention)
Basic linear algebra (affine transformations)

Key Terms

PTQ: Post-Training Quantization—compressing a model using a small calibration set without full retraining

LayerNorm: Layer Normalization—a technique to normalize neuron activities; in Transformers, it often contains extreme outliers

Softmax: Activation function converting scores to probabilities; in Transformers, it exhibits a power-law distribution where a few values dominate

GPTQ: A state-of-the-art weight quantization method that reconstructs weights to minimize error layer-by-layer

Scale Reparameterization: Mathematically transforming model weights/biases to change the quantization scale requirements without altering the output

Channel-wise quantization: Using a separate quantization scale for each channel (accurate but expensive)

Layer-wise quantization: Using a single quantization scale for an entire layer (efficient but less accurate)

Log2 quantization: Quantization where levels are powers of 2, allowing multiplication to be replaced by bit-shifts

mAP: Mean Average Precision—a key metric for object detection performance