PTQ4SAM: Post-Training Quantization for Segment Anything

📝 Paper Summary

Model Compression Post-Training Quantization (PTQ) Vision Transformers

PTQ4SAM is a post-training quantization framework for the Segment Anything Model that resolves activation outliers by mathematically transforming bimodal distributions into normal ones and using adaptive granularity for attention scores.

Core Problem

Directly quantizing SAM (Segment Anything Model) fails because of two unique activation issues: bimodal distributions in key-linear outputs and drastically different post-Softmax distributions across attention types.

Why it matters:

SAM is computationally expensive, hindering deployment on edge devices; standard quantization causes severe accuracy drops due to these unique distribution shifts.
Existing Vision Transformer quantization methods do not account for the specific bimodal outliers found in SAM's key projections.
Diverse attention mechanisms in SAM (self-attention vs. two-way cross-attention) require specialized handling rather than a one-size-fits-all quantization approach.

Concrete Example: In SAM's key-linear layer, activations cluster around two distinct peaks (e.g., -8 and +8) with a void in between. Standard quantization wastes bit-width representing the empty space, causing large errors. Additionally, image-to-token attention scores are mostly >0.01, while token-to-image scores are mostly near zero, yet prior methods quantize them identically.

Key Novelty

Bimodal Integration & Adaptive Granularity

Bimodal Integration (BIG): Detects channels with bimodal peaks and flips the sign of negative-peak channels (and corresponding weights) to merge two peaks into one normal distribution.
Adaptive Granularity Quantization (AGQ): Dynamically adjusts the base of the logarithmic quantizer for softmax outputs, allowing higher precision for small values or large values depending on the specific attention type (cross vs. self).

Architecture

The overall PTQ4SAM framework illustrating the Bimodal Integration (BIG) and Adaptive Granularity Quantization (AGQ) modules.

Evaluation Highlights

Achieves lossless accuracy on instance segmentation using 6-bit quantization for SAM-L (Large) and SAM-H (Huge) variants compared to full precision.
Reduces computational cost (FLOPs) by 3.9x and storage by 4.9x for the SAM-L model while maintaining performance within ~0.5% of the original.
Outperforms state-of-the-art PTQ (Post-Training Quantization) methods like PD-Quant and Q-Drop by significant margins across varying bit-widths (e.g., 4-bit, 6-bit).

Breakthrough Assessment

8/10

First dedicated PTQ solution for SAM. The Bimodal Integration strategy is a clever, mathematically equivalent transformation that solves a specific structural issue in SAM, enabling low-bit inference where generic methods fail.

⚙️ Technical Details

Problem Definition

Setting: Post-training quantization of Transformer-based segmentation models (SAM) for efficient inference.

Inputs: Pre-trained Segment Anything Model (SAM) weights and a small unlabeled calibration dataset.

Outputs: Quantized model weights and activation quantization parameters (scale factors, zero points).

Pipeline Flow

Bimodal Integration (Offline Weight Transformation)
Quantized Inference (Activations & Weights)
Adaptive Granularity Softmax

System Modules

Bimodal Integration (BIG)

Detects bimodal channels in Key linear layers and multiplies weights by sign factor gamma to enforce unimodal distribution.

Model or implementation: Mathematical transformation (Sign Flip)

Adaptive Granularity Quantization (AGQ)

Quantizes post-softmax attention scores using a search-based power-of-two base.

Model or implementation: Logarithmic Quantizer

Novel Architectural Elements

Bimodal Integration mechanism: Modifies the static weights of Key/Query projection layers to normalize activation distributions without changing mathematical equivalence.
Adaptive Granularity Quantizer: Replaces standard uniform or fixed-log softmax quantization with a variable-base log quantizer optimized for matrix-multiplication error.

Modeling

Base Model: Segment Anything Model (SAM) variants: ViT-B, ViT-L, ViT-H

Training Method: Post-Training Quantization (Calibration only)

Objective Functions:

Purpose: Select optimal base tau for AGQ.

Formally: minimize || (A_quant * V) - (A_float * V) ||_F^2, where A is attention map and V is value matrix.

Adaptation: Quantization parameters calibrated on small unlabeled set

Trainable Parameters: None (Weights are frozen or rounding is optimized in learning-based PTQ settings like AdaRound)

Training Data:

Calibration set: Small number of samples (e.g., 32 images) from COCO or SA-1B
Standard PTQ calibration

Key Hyperparameters:

n_bits_weight: 4, 6, 8
n_bits_activation: 4, 6, 8
tau_search_space: {2^0, 2^1, ..., 2^n}

Comparison to Prior Work

vs. MinMax/Percentile: PTQ4SAM specifically targets SAM's bimodal activation outliers which break standard uniform quantization assumptions.
vs. PD-Quant/Q-Drop: PTQ4SAM acts as a plug-and-play enhancement that can be combined with these methods, rather than just competing with them. It adds specific handling for SAM's diverse attention mechanisms.
vs. Reparameterization methods (e.g. for ViT): PTQ4SAM addresses post-Key-Linear bimodality, whereas prior works focused on LayerNorm or other components [not cited in paper].

Limitations

Effectiveness relies on the assumption that bimodal distributions are channel-wise separable; if peaks are mixed within channels, BIG may fail.
Requires hardware support for lookup tables (LUT) to implement AGQ efficiently, though most NPUs support this.
Mainly evaluated on standard vision tasks; performance on highly specialized domains (e.g., medical imaging beyond minimal testing) is less explored in depth.

Reproducibility

Code: https://github.com/chengtao-lv/PTQ4SAM

Code is publicly available at https://github.com/chengtao-lv/PTQ4SAM. Uses standard COCO/LVIS datasets. Implementation details for calibration (e.g., 32 samples) provided.

📊 Experiments & Results

Evaluation Setup

Instance segmentation on COCO/LVIS using SAM backbones.

Benchmarks:

COCO 2017 (Instance Segmentation)
LVIS v1.0 (Instance Segmentation)

Metrics:

mAP (Mean Average Precision)
mIoU (Mean Intersection over Union)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Instance Segmentation (COCO 2017) using SAM-B with W4A4 (4-bit weights, 4-bit activations) and W6A6 (6-bit) settings shows PTQ4SAM outperforming standard PTQ baselines.
COCO 2017	mAP	0.1	39.4	+39.3
COCO 2017	mAP	4.2	39.4	+35.2
COCO 2017	mAP	35.3	42.5	+7.2
COCO 2017	mAP	4.3	44.6	+40.3
Ablation study showing the individual contributions of Bimodal Integration (BIG) and Adaptive Granularity Quantization (AGQ) on SAM-B (W6A6).
COCO 2017	mAP	4.3	42.7	+38.4
COCO 2017	mAP	42.7	44.6	+1.9

Experiment Figures

Visual analysis of SAM activation distributions highlighting the challenges for quantization.

Channel-wise visualization of the bimodal phenomenon.

Main Takeaways

Naive quantization (MinMax, Percentile) fails catastrophically on SAM due to bimodal activations, yielding near-zero mAP.
Bimodal Integration (BIG) is the primary driver of performance recovery, fixing the distribution shape for the quantizer.
Adaptive Granularity Quantization (AGQ) provides further refinement, particularly important for handling the diverse attention mechanisms (cross vs self) in SAM.
The method is plug-and-play: it significantly boosts performance when added to existing advanced PTQ methods like PD-Quant and Q-Drop.

📚 Prerequisite Knowledge

Prerequisites

Basics of Post-Training Quantization (uniform vs. logarithmic)
Transformer architecture (Self-Attention, Cross-Attention)
Segment Anything Model (SAM) structure

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

SAM: Segment Anything Model—a foundation model for image segmentation capable of zero-shot transfer via prompting.

PTQ: Post-Training Quantization—reducing the precision of a pre-trained model (e.g., 32-bit float to 8-bit integer) without full re-training.

QAT: Quantization-Aware Training—re-training a model with simulated quantization errors to adapt weights.

Bimodal Distribution: A probability distribution with two distinct peaks (modes), separated by a sparse region.

Post-Key-Linear: The activations resulting from the linear projection that produces the 'Key' vectors in a Transformer attention block.

BIG: Bimodal Integration—the proposed method to merge bimodal activation peaks into a single unimodal distribution via sign flipping.

AGQ: Adaptive Granularity Quantization—the proposed method to dynamically adjust quantization precision for softmax outputs using a base-2 log scale.

mIoU: Mean Intersection over Union—a standard metric for segmentation accuracy measuring overlap between predicted and ground truth masks.

FLOPs: Floating Point Operations—a measure of computational cost.