AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer

📝 Paper Summary

Post-Training Quantization (PTQ) Vision Transformers (ViT) Model Compression

AdaLog improves low-bit quantization of Vision Transformers by using an adaptive logarithmic base for power-law activations and a fast progressive search to optimize parameters.

Core Problem

Existing PTQ methods for ViT activations use fixed logarithmic bases (like base-2) that cannot adapt to varying power-law distributions across layers, leading to high error at low bit-widths.

Why it matters:

Vision Transformers are computationally expensive and slow on edge devices, necessitating compression.
Current fixed-base log quantizers suffer from either large rounding errors for large values or truncation errors for small values when bit-width is low (e.g., 4-bit).
Standard grid search for quantization parameters is either too coarse (missing optima) or too slow (brute force).

Concrete Example: The Log2 quantizer incurs substantial rounding errors for large activations under 4-bits, while the Log-Sqrt(2) quantizer suffers from truncation errors for small activations under 3-bits. Additionally, Log-Sqrt(2) requires floating-point multiplication during inference, which is not hardware-friendly.

Key Novelty

Adaptive Logarithm (AdaLog) Quantizer with Fast Progressive Combining Search (FPCS)

Proposes a non-uniform quantizer that optimizes the logarithmic base per layer instead of using a fixed base (like 2), better fitting the power-law distribution of post-Softmax/GELU activations.
Implements a hardware-friendly de-quantization mechanism using look-up tables and integer-only arithmetic, avoiding floating-point operations despite the arbitrary base.
Introduces a search strategy (FPCS) that progressively refines the hyperparameter search space (coarse-to-fine) to find optimal quantization parameters efficiently.

Architecture

Comparison of de-quantization flows for standard Log2, Log-Sqrt(2), and the proposed AdaLog.

Evaluation Highlights

Significantly outperforms state-of-the-art PTQ methods on ImageNet classification, COCO detection, and segmentation tasks.
Achieves higher accuracy at low bit-widths (e.g., 4-bit and 3-bit) compared to fixed-base log quantizers.
FPCS strategy locates optimal hyperparameters more precisely with linear complexity, unlike brute-force (quadratic) or alternating search (local optima).

Breakthrough Assessment

7/10

Strong practical improvement for low-bit ViT quantization. addressing specific distribution mismatches in prior work. The hardware-friendly implementation of arbitrary bases is a clever engineering contribution.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of Vision Transformers, specifically targeting activations with power-law distributions (post-Softmax, post-GELU).

Inputs: Full-precision pre-trained Vision Transformer model and a small calibration dataset.

Outputs: Quantized model with integer weights and activations suitable for efficient inference.

Pipeline Flow

Calibration (collect activation statistics)
Bias Reparameterization (shift post-GELU activations to be non-negative)
FPCS Search (find optimal log base and scaling factors)
Quantization (map values to integers using adaptive log base)
Inference (use look-up tables for efficient de-quantization/computation)

System Modules

AdaLog Quantizer

Quantize activations using a learnable logarithmic base 'b'

Model or implementation: Non-uniform quantizer defined by base b and scale s

Bias Reparameterizer

Shift post-GELU activations to positive range to allow log quantization

Model or implementation: Linear shift

FPCS Searcher

Efficiently find optimal hyperparameters (base, scale) for each layer

Model or implementation: Iterative coarse-to-fine search algorithm

Novel Architectural Elements

Look-up table based de-quantization mechanism for arbitrary logarithmic bases that avoids floating point operations.
Application of bias reparameterization specifically to enable log-quantization of GELU layers (handling negative tails).

Modeling

Base Model: Various Vision Transformers (ViT, DeiT, Swin Transformer)

Training Method: Post-Training Quantization (calibration only)

Objective Functions:

Purpose: Minimize reconstruction error between quantized and full-precision activations.

Formally: MSE(Quantized(X), X)

Adaptation: Not applicable (no training, just calibration)

Training Data:

Small calibration set (typically 32-128 images) sampled from training data (e.g., ImageNet)

Key Hyperparameters:

search_steps: Not explicitly reported in the paper
calibration_size: Not explicitly reported in the paper (standard is usually 32-128)

Compute: Not reported in the paper

Comparison to Prior Work

vs. FQ-ViT: AdaLog uses adaptive base instead of fixed base-2, reducing quantization error.
vs. RepQ-ViT: AdaLog supports arbitrary bases while remaining hardware friendly (RepQ uses floating point for Log-Sqrt(2)); AdaLog handles GELU via bias reparameterization.
vs. PTQ4ViT: FPCS search is more efficient and finer-grained than the grid/alternating search used in PTQ4ViT.

Limitations

Requires look-up tables (LUTs) for de-quantization, which adds a small memory overhead (though negligible usually).
Focuses primarily on activation quantization; weight quantization is treated with standard methods.
Effectiveness depends on the 'power-law-like' assumption; if distributions shift significantly from this, benefits might diminish.

Reproducibility

Code: https://github.com/GoatWu/AdaLog

Code is publicly available at https://github.com/GoatWu/AdaLog. The paper describes the algorithms (AdaLog and FPCS) mathematically. Calibration dataset size details are standard for PTQ but not explicitly detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

Quantization of pre-trained ViT models on standard vision tasks.

Benchmarks:

ImageNet (Image Classification)
COCO (Object Detection and Instance Segmentation)

Metrics:

Top-1 Accuracy
mAP (mean Average Precision)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper claims significant performance improvements over SOTA, but the provided text snippet does not contain the specific result tables with numeric values. The text mentions 'Extensive experimental results... demonstrate the effectiveness', and 'significantly outperforms... especially in low-bit quantization', but specific numbers are not in the provided excerpt.

Experiment Figures

Histograms of post-Softmax activations with quantization levels of Log2 and Log-Sqrt(2) overlaid.

Distribution of post-GELU activations across different layers.

Main Takeaways

AdaLog successfully addresses the limitations of fixed-base log quantizers for power-law distributions in ViTs.
The method is effective across different ViT architectures (ViT, DeiT, Swin) and tasks (Classification, Detection, Segmentation).
The FPCS search strategy allows for finer hyperparameter tuning without the computational cost of brute-force search.
Hardware-friendly design is maintained despite using arbitrary logarithmic bases.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformer (ViT) architecture (Self-Attention, GELU, Softmax)
Model Quantization basics (Uniform vs. Non-uniform, Scaling factors)
Logarithmic number systems

Key Terms

PTQ: Post-Training Quantization—compressing a model after training using only a small calibration set, without full retraining.

AdaLog: Adaptive Logarithm Quantizer—the proposed method that learns an optimal base for logarithmic quantization.

FPCS: Fast Progressive Combining Search—a search strategy that iteratively refines the search grid for quantization parameters.

Bias Reparameterization: A technique to absorb quantization errors or shift distributions (like making GELU outputs non-negative) by adjusting bias terms.

Power-law distribution: A distribution where frequency decreases as a power of the value; common in Softmax/GELU outputs, having 'long tails'.

Softmax: Activation function that converts logits to probabilities; in ViTs, these outputs often have a power-law distribution.

GELU: Gaussian Error Linear Unit—activation function used in ViTs; outputs are mostly non-negative but have a small negative tail.

De-quantization: The process of mapping integer indices back to approximate real values (or performing operations that simulate this).