DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers

📝 Paper Summary

Model Compression Post-Training Quantization (PTQ)

DopQ-ViT improves low-bit vision transformer quantization by using a tangent-based function to preserve extreme activation values and a statistical method to identify optimal scaling factors for outlier channels.

Core Problem

Post-training quantization for Vision Transformers degrades significantly at low bit-widths (e.g., 3-bit) due to imbalanced activation distributions and outliers.

Why it matters:

Vision Transformers have high computational costs, limiting deployment on edge devices with constrained memory and power.
Existing quantizers focus on values near 0 but neglect crucial values near 1 in Softmax outputs, destroying model information.
Standard reparameterization techniques for LayerNorm fail because they use mean scaling factors that are skewed by abnormal outlier channels.

Concrete Example: In a ViT-B model, removing just the top 1% of activation values near 1 causes accuracy to drop to 2.20%, yet standard Log2 quantizers allocate almost no precision to this region. Similarly, using a simple mean for scaling factors drops accuracy by ~15% compared to channel-wise quantization.

Key Novelty

DopQ-ViT (Distribution-friendly and Outlier-aware Post-training Quantization)

Introduces Tan Quantizer (TanQ), which uses a tangent function to map quantization intervals. This allocates high precision to both the dense values near 0 and the sparse but critical values near 1.
Proposes MAD-guided Optimal Scaling Factor (MOSF), a search-free method that uses Mean Absolute Deviation to select a scaling factor robust to outliers in LayerNorm activations.

Architecture

Comparison of activation distributions and quantization functions. (a) Histogram of post-Softmax activations. (b) Curves for Log2, Uniform, and Tan Quantizers. (c) Channel-wise variations in post-LayerNorm activations.

Evaluation Highlights

Outperforms state-of-the-art PTQ methods (like RepQ-ViT and I&S-ViT) on ImageNet classification and COCO detection tasks.
Achieves significant accuracy recovery under aggressive low-bit settings (e.g., W3A3), where prior methods suffer major degradation.
MOSF prevents the ~15% accuracy drop observed when transitioning from channel-wise to layer-wise quantization in ViT-S models.

Breakthrough Assessment

7/10

Solid incremental improvement for low-bit ViT quantization. Identifies specific failure modes of previous log-quantizers (ignoring values near 1) and provides a mathematically grounded, efficient fix.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of Vision Transformers, specifically targeting low-bit settings (e.g., 3-bit and 4-bit) for weights and activations.

Inputs: Pre-trained Vision Transformer model (e.g., ViT, DeiT, Swin) and a small calibration dataset.

Outputs: Quantized model with integer weights and activations.

Pipeline Flow

Input Patches
Tan Quantizer (for Post-Softmax Activations)
MOSF Reparameterization (for Post-LayerNorm Activations)
Standard Quantization (for other layers)
Output

System Modules

Tan Quantizer (TanQ) (Activation Quantization)

Quantize post-Softmax activations using a tangent-based mapping to preserve values near 0 and 1

Model or implementation: Non-linear mapping function: tan(a(x-b))

MOSF Reparameterization (Activation Quantization)

Determine optimal layer-wise scaling factor to replace channel-wise factors

Model or implementation: Statistical search minimizing MAD metric

Novel Architectural Elements

TanQ module replacing standard Log2 or Uniform quantizers for Softmax outputs
MOSF module modifying LayerNorm parameters before quantization

Modeling

Base Model: ViT (ViT-S, ViT-B), DeiT (DeiT-T, DeiT-S), Swin Transformer (Swin-S, Swin-B)

Training Method: Post-Training Quantization (Calibration only)

Objective Functions:

Purpose: Select optimal scaling factor for LayerNorm reparameterization.

Formally: Minimize MAD error metric on calibration data.

Compute: Not reported in the paper

Comparison to Prior Work

vs. FQ-ViT: TanQ preserves values near 1, whereas Log2 focuses only on values near 0.
vs. RepQ-ViT: MOSF selects a robust scaling factor rather than a simple mean, preventing performance drops from outlier channels.
vs. I&S-ViT: TanQ uses a monotonic smooth function (tangent) rather than shifting and combining uniform/log logic.
+ 1 more
vs. APQ-ViT [not cited in paper]: APQ-ViT uses Matthew-effect Preserving Quantization; DopQ-ViT similarly targets distribution preservation but via a tangent function.

Limitations

TanQ introduces non-linear operations (tangent/arctan) which may require lookup tables (LUTs) on some hardware, though the paper claims efficiency.
Focus is primarily on classification and detection; impact on dense prediction tasks like segmentation is less explored.
Depends on calibration data to tune hyperparameters 'a' and 'b' for TanQ.

Reproducibility

No code URL provided in the paper text. Method relies on standard ViT backbones and calibration datasets (ImageNet-1k, COCO).

📊 Experiments & Results

Evaluation Setup

Image Classification on ImageNet-1k and Object Detection on COCO.

Benchmarks:

ImageNet-1k (Image Classification)
COCO (Object Detection)

Metrics:

Top-1 Accuracy (%)
mAP (mean Average Precision)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation study showing the impact of removing activation values near 1 vs near 0.
ImageNet-1k (ViT-B)	Top-1 Accuracy	70.73	2.20	-68.53
Comparison of different quantization strategies for Post-LayerNorm activations.
ImageNet-1k (ViT-S)	Top-1 Accuracy	55.70	40.88	-14.82

Main Takeaways

Values near 1 in post-Softmax activations are critical; removing them destroys model performance (e.g., ViT-B accuracy drops to 2.2%).
Standard reparameterization (using mean scaling) fails for post-LayerNorm activations due to outliers, causing massive accuracy drops compared to channel-wise quantization.
DopQ-ViT effectively bridges the gap between channel-wise and layer-wise quantization performance without the hardware cost of channel-wise operations.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformers (ViT) architecture (MSA, MLP, LayerNorm)
Quantization fundamentals (uniform vs. log quantization, scaling factors)
Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT)

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

ViT: Vision Transformer—a model architecture applying the Transformer mechanism directly to sequences of image patches

PTQ: Post-Training Quantization—quantizing a pre-trained model using only a small calibration set without full retraining

TanQ: Tan Quantizer—the proposed quantization function based on the tangent function to fit power-law distributions

MOSF: MAD-guided Optimal Scaling Factor—the proposed method to select scaling factors by minimizing Mean Absolute Deviation

MSA: Multi-Head Self-Attention—mechanism in Transformers that captures correlations between different input patches

LayerNorm: Layer Normalization—a technique to normalize neuron activities within a layer to stabilize training

MAD: Mean Absolute Deviation—a robust measure of variability used here to detect and handle outliers in scaling factors

RepQ-ViT: A prior PTQ method for ViTs using scale reparameterization and log-sqrt2 quantization

Log2 Quantizer: A quantization scheme that allocates intervals logarithmically, giving more precision to small values

W4A4: 4-bit quantization for both Weights (W) and Activations (A)

W3A3: 3-bit quantization for both Weights (W) and Activations (A)