APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers

📝 Paper Summary

Model Compression Efficient Deep Learning Vision Transformers (ViT)

APHQ-ViT improves low-bit vision transformer quantization by replacing inaccurate Hessian approximations with a direct perturbation-based metric and substituting hard-to-quantize GELU activations with ReLU during reconstruction.

Core Problem

Vision Transformers suffer severe accuracy drops under ultra-low bit post-training quantization due to inaccurate output importance estimation and the difficult distribution of post-GELU activations.

Why it matters:

Standard MSE loss treats all tokens equally, ignoring the critical class token and channel variations essential for ViT performance
Existing Hessian-based metrics rely on Fisher Information Matrix approximations that fail when the model doesn't fit the data distribution well
Post-GELU activations have highly imbalanced distributions (dense negative values vs. sparse positive values) and large ranges, causing massive quantization errors

Concrete Example: In post-GELU activations, negative values cluster densely in [-0.17, 0] while positive values are sparse and reach up to 40. Standard uniform quantizers waste bins on empty regions or clip important outliers, degrading accuracy significantly.

Key Novelty

Average Perturbation Hessian (APH) & MLP Reconstruction (MR)

Calculates output sensitivity (Hessian) directly by perturbing outputs and measuring loss changes, removing errors from Fisher Information approximations used in prior work
Replaces the quantization-unfriendly GELU activation with ReLU during an MLP reconstruction stage, utilizing knowledge distillation to align the ReLU network's behavior with the original GELU network
Applies a clamping loss to restrict activation ranges, making the substituted ReLU activations much easier to quantize linearly

Architecture

The overall framework of APHQ-ViT, illustrating the block-wise quantization pipeline.

Evaluation Highlights

Outperforms state-of-the-art PTQ4ViT by ~1-30% accuracy on ImageNet classification across various ViT architectures (ViT-S/B, DeiT-S/B, Swin-S/B) under 3-bit and 4-bit settings
Achieves 78.43% top-1 accuracy on ViT-B with 4-bit quantization, surpassing the previous best method (PTQ4ViT) by 3.65%
Demonstrates robust generalization to object detection (COCO) and instance segmentation tasks, where prior Hessian approximations often fail

Breakthrough Assessment

8/10

Significantly advances ultra-low bit (3-bit/4-bit) quantization for ViTs by fixing fundamental theoretical flaws in Hessian estimation and proposing a practical architectural substitution (ReLU for GELU) that simplifies the quantization landscape.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of Vision Transformers using a small unlabeled calibration dataset

Inputs: Pre-trained floating-point Vision Transformer model and a small calibration set (e.g., 1024 images)

Outputs: Quantized model with low-bit weights and activations (e.g., W3A3, W4A4)

Pipeline Flow

MLP Reconstruction: Replace GELU with ReLU and retrain MLP weights via distillation
Quantization Reconstruction: Optimize weights/activation quantization parameters block-by-block

System Modules

MLP Reconstruction (MR)

Modify the network architecture to be more quantization-friendly by swapping activation functions

Model or implementation: MLP blocks within ViT

Average Perturbation Hessian (APH) Estimator

Calculate the importance of each output neuron to guide reconstruction

Model or implementation: Mathematical operator

Quantization Reconstruction

Fine-tune rounding and quantization parameters (step sizes)

Model or implementation: Linear Quantizers

Novel Architectural Elements

Permanent replacement of GELU layers with ReLU layers during the PTQ process to facilitate lower-bit quantization
Integration of a finite-difference Hessian estimator directly into the block reconstruction loop

Modeling

Base Model: Various Vision Transformers (ViT-S/B, DeiT-S/B/T, Swin-S/B)

Training Method: Block-wise reconstruction with APH loss

Objective Functions:

Purpose: Estimate the curvature (Hessian) of the loss landscape to weight quantization errors.

Formally: H_ii ≈ (∇L(O + ΔO) - ∇L(O - ΔO)) / (2ΔO)
Purpose: Align the new ReLU-based MLP output with the original GELU-based output.

Formally: L_MLP = ||O_ReLU - O_GELU||^2_H + α * ||Clip(O_ReLU) - O_ReLU||^2
Purpose: Minimize quantization error weighted by importance.

Formally: L_rec = (O - O_quant)^T * H_bar * (O - O_quant)

Training Data:

Calibration set: 1024 images randomly sampled from ImageNet training set

Key Hyperparameters:

perturbation_magnitude_epsilon: 1e-6
optimization_iterations: 1000 per block
clamp_loss_weight_alpha: 2
+ 2 more
percentile_p: 99.9
batch_size: 32

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. PTQ4ViT: APHQ-ViT uses direct perturbation for Hessian estimation instead of Fisher approximation; replaces GELU with ReLU instead of using specialized twin-uniform quantizers.
vs. BRECQ: APHQ-ViT adapts the Hessian metric to be robust for ViTs (removing the gradient-squared approximation) and generalizes it to detection/segmentation tasks via distillation loss.
vs. AdaLog: APHQ-ViT achieves high performance using standard linear quantizers, avoiding specialized hardware requirements like logarithmic quantizers.

Limitations

Replacing GELU with ReLU alters the architecture, which might theoretically affect the model's expressivity (though empirical results are good)
Requires an extra forward/backward pass per sample to compute the perturbation Hessian compared to gradient-squared methods (though complexity class is similar)
Effectiveness primarily demonstrated on Vision Transformers; applicability to LLMs or other architectures not explored

Reproducibility

Code: https://github.com/GoatWu/APHQ-ViT

Code is publicly available at https://github.com/GoatWu/APHQ-ViT. The paper specifies calibration set size (1024), perturbation magnitude, and block reconstruction iterations.

📊 Experiments & Results

Evaluation Setup

Post-training quantization on ImageNet classification, COCO detection, and ADE20k segmentation

Benchmarks:

ImageNet (ILSVRC-2012) (Image Classification)
COCO (Object Detection & Instance Segmentation)
ADE20k (Semantic Segmentation)

Metrics:

Top-1 Accuracy
mAP (mean Average Precision)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative results on ImageNet classification showing APHQ-ViT's dominance in ultra-low bit settings (W3A3, W4A4).
ImageNet	Top-1 Accuracy	74.78	78.43	+3.65
ImageNet	Top-1 Accuracy	42.57	72.47	+29.90
ImageNet	Top-1 Accuracy	30.69	63.78	+33.09
COCO (Detection)	mAP	40.3	42.0	+1.7
Ablation studies validating the individual contributions of Average Perturbation Hessian (APH) and MLP Reconstruction (MR).
ImageNet	Top-1 Accuracy	1.34	63.78	+62.44

Experiment Figures

Comparison of activation distributions for Post-GELU (original) vs. Post-ReLU (reconstructed) and their impact on quantization error.

Main Takeaways

Standard Hessian metrics (like in BRECQ/PTQ4ViT) degrade performance for ViTs because the Fisher approximation assumptions do not hold; direct perturbation (APH) is robust.
GELU activation is a primary bottleneck for ViT quantization; replacing it with ReLU via reconstruction (MR) drastically simplifies the task, enabling linear quantizers to work well even at 3-bit.
The method generalizes beyond classification to dense prediction tasks (Detection, Segmentation) where standard Hessian metrics often fail completely.
Improvements are most dramatic in lower bit-widths (3-bit), where sensitivity to outliers and distribution mismatch is highest.

📚 Prerequisite Knowledge

Prerequisites

Post-Training Quantization (PTQ)
Vision Transformer (ViT) architecture
Hessian matrix and Taylor expansion
Knowledge Distillation

Key Terms

PTQ: Post-Training Quantization—converting a pre-trained model to lower precision without full retraining, using only a small calibration dataset

Hessian Matrix: A square matrix of second-order partial derivatives of a scalar-valued function, describing the local curvature of the loss landscape

Fisher Information Matrix (FIM): A matrix approximating the Hessian, often used in quantization metrics but reliant on assumptions that may not hold for ViTs

GELU: Gaussian Error Linear Unit—an activation function used in ViTs that is smooth but produces a distribution difficult to quantize due to its negative tail and heavy positive skew

Block-Reconstruction: A PTQ strategy that optimizes quantization parameters block-by-block to minimize the error between the quantized block's output and the original block's output

AdaRound: Adaptive Rounding—a method to determine whether to round weights up or down to minimize task loss, rather than just rounding to the nearest integer

BRECQ: Block Reconstruction Quantization—a state-of-the-art PTQ method for CNNs that uses Hessian-guided metrics

Jacobian Matrix: The matrix of all first-order partial derivatives of a vector-valued function