TSPTQ-ViT: Two-Scaled Post-Training Quantization for Vision Transformer

📝 Paper Summary

Model Compression Post-Training Quantization (PTQ) Efficient Inference

TSPTQ-ViT enables fully quantized vision transformers by using dual scaling factors to handle the extreme value distributions in Softmax, GeLU, and LayerNorm without retraining.

Core Problem

Standard post-training quantization fails on Vision Transformers because activation functions (Softmax, GeLU) produce non-normal distributions and LayerNorm exhibits high channel-wise variance, causing severe accuracy loss.

Why it matters:

Vision Transformers (ViTs) are computationally heavy (e.g., ViT-L has 64G FLOPs), making them unsuitable for resource-constrained edge devices without compression
Existing quantization methods often skip sensitive layers (Softmax/LayerNorm) to preserve accuracy, preventing efficient fully integer-only inference
Prior fully quantized attempts (like FQ-ViT) introduce high memory overhead or quantization errors for large values

Concrete Example: In LayerNorm inputs, the maximum value can be 40 times larger than the median. A single uniform scaling factor dominated by this outlier causes small values—which contain most of the information—to be quantized to zero, destroying accuracy.

Key Novelty

Two-Scaled Post-Training Quantization (TSPTQ)

V-2SF (Value-Aware): Splits activations into two regions based on magnitude. Large values (outliers) keep the most significant bits, while small values (dense region) keep the least significant bits, effectively increasing precision.
O-2SF (Outlier-Aware): Detects specific channels in LayerNorm that contain extreme values and assigns them a dedicated scaling factor, preventing them from distorting the quantization range of normal channels.

Architecture

Overview of the TSPTQ-ViT scaling mechanisms: V-2SF for activations and O-2SF for LayerNorm inputs.

Evaluation Highlights

Achieves <0.5% accuracy drop on ImageNet classification with 8-bit fully quantized ViT models compared to full-precision baselines
Outperforms state-of-the-art FQ-ViT by +2.14% top-1 accuracy on Swin-B (8-bit quantization)
Surpasses PTQ4ViT by ~1% accuracy on DeiT-T and DeiT-S under 6-bit quantization settings

Breakthrough Assessment

7/10

Strong engineering solution for fully quantized ViTs. Effectively addresses specific distribution bottlenecks (Softmax/GeLU/LayerNorm) with low overhead, outperforming existing PTQ methods significantly on larger models.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of Vision Transformers for Image Classification

Inputs: Input images (ImageNet dataset)

Outputs: Classification labels (Top-1 Accuracy)

Pipeline Flow

Input Processing
Transformer Encoder Block (LayerNorm -> MSA -> LayerNorm -> MLP)
Output Head

System Modules

LayerNorm Quantizer (O-2SF)

Quantize inputs to LayerNorm handling high channel variance

Model or implementation: Outlier-Aware Two-Scaled Scaling Factors

Softmax Quantizer (V-2SF) (Activation)

Quantize post-Softmax values

Model or implementation: Value-Aware Two-Scaled Scaling Factors

GeLU Quantizer (V-2SF) (Activation)

Quantize post-GeLU values

Model or implementation: Value-Aware Two-Scaled Scaling Factors

Novel Architectural Elements

V-2SF: Dual scaling factor mechanism for activations that switches interpretation of bits (MSB vs LSB) based on value magnitude
O-2SF: Channel-mask-based scaling for LayerNorm that applies different scales to outlier channels vs normal channels

Modeling

Base Model: Pre-trained ViT, DeiT, and Swin Transformer variants (from timm library)

Training Method: Hessian guided scaling factor search (calibration only, no fine-tuning)

Objective Functions:

Purpose: Minimize quantization error impact on task loss.

Formally: Hessian guided metric (from PTQ4ViT) to search for optimal scaling factors.

Key Hyperparameters:

calibration_batch_size: 32
search_rounds: 3
candidate_size_N: 100
+ 4 more
candidate_size_N_prime: 6
search_space_max: 1.2 * max(abs(tensor))
m_softmax: 4 (shift amount for scaling)
m_gelu: 3 (shift amount for scaling)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PTQ4ViT: TSPTQ achieves fully quantized inference (integer Softmax/LayerNorm) whereas PTQ4ViT keeps them in FP; TSPTQ uses flexible scaling alignment vs fixed powers-of-2
vs. FQ-ViT: TSPTQ uses O-2SF (1-bit overhead/channel) vs FQ-ViT's channel-wise scaling (2-bit overhead/channel); TSPTQ avoids Log2 quantization error for large values

Limitations

Requires Hessian guided search which adds computational cost during the calibration phase compared to simple Min-Max calibration
Memory overhead of 1 bit per channel for LayerNorm (though less than FQ-ViT's 2 bits)
Evaluated only on ImageNet classification; transferability to detection/segmentation not tested

Reproducibility

Code availability is not provided. The method relies on pre-trained models from the 'timm' library. Implementation details for the Hessian guided search (rounds, candidate sizes) and shift parameters (m values) are explicitly provided.

📊 Experiments & Results

Evaluation Setup

Image Classification on ImageNet (ILSVRC 2012)

Benchmarks:

ImageNet (Image Classification)

Metrics:

Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparisons of 8-bit fully quantized models showing TSPTQ-ViT recovering near-FP accuracy and outperforming FQ-ViT.
ImageNet	Top-1 Accuracy	81.39	81.20	-0.19
ImageNet	Top-1 Accuracy	82.97	85.11	+2.14
ImageNet	Top-1 Accuracy	79.85	79.56	-0.29
6-bit quantization comparisons demonstrating robustness at lower bit-widths.
ImageNet	Top-1 Accuracy	78.63	79.34	+0.71
ImageNet	Top-1 Accuracy	76.28	77.27	+0.99
Ablation study showing the impact of the O-2SF component.
ImageNet	Top-1 Accuracy	70.74	81.20	+10.46

Experiment Figures

Post-quantization distribution histograms comparing PTQ4ViT and Proposed V-2SF against the original distribution.

Main Takeaways

Proposed method enables near-lossless (<0.5% drop) 8-bit fully quantized ViTs, significantly closing the gap with full precision models.
Outlier-Aware scaling (O-2SF) is critical for ViT-S, recovering over 10% accuracy compared to using Value-Aware scaling (V-2SF) alone.
Flexible scaling design in V-2SF avoids redundant integer bins found in PTQ4ViT's fixed scaling, improving granularity for small values.
Method outperforms FQ-ViT (SOTA fully quantized baseline) across all tested models (ViT, DeiT, Swin), with margins up to ~2.1% on Swin-B.

📚 Prerequisite Knowledge

Prerequisites

Principles of quantization (scaling factors, zero-points)
Vision Transformer (ViT) architecture components (Self-Attention, MLP, LayerNorm)
Activation functions (Softmax, GeLU)

Key Terms

PTQ: Post-Training Quantization—converting a pre-trained model to lower bit-width integers without full retraining, using only a small calibration dataset

ViT: Vision Transformer—a model architecture applying Transformer self-attention mechanisms directly to sequences of image patches

LayerNorm: Layer Normalization—a technique to normalize neuron activities, known in ViTs for having high variance across channels

GeLU: Gaussian Error Linear Unit—an activation function used in ViTs that has an asymmetric distribution (positive range wider than negative)

Hessian guided metric: A method to determine optimal quantization parameters by considering the curvature of the loss function (using Hessian info) to minimize impact on final loss

Bit sparsity: The observation that in non-normal distributions, certain bits (MSB or LSB) are often unused or redundant, allowing for compression or specialized scaling

Fully quantized: A model where all operations, including complex non-linearities like Softmax and Normalization, are executed using integer arithmetic