Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

📝 Paper Summary

Post-Training Quantization (PTQ) Large Language Model (LLM) Compression

aespa is a post-training quantization method that achieves accuracy comparable to block-wise optimization but at 10x speed by optimizing attention-output reconstruction through efficient pre-computed Hessians.

Core Problem

Existing Post-Training Quantization (PTQ) methods face a dilemma: layer-wise methods (like AdaRound) are fast but inaccurate for LLMs because they ignore cross-layer dependencies, while block-wise methods (like BRECQ) are accurate but computationally prohibitive for billion-parameter models.

Why it matters:

Deploying hyper-scale models (LLMs) on edge devices requires aggressive compression (quantization) to reduce memory and compute costs
Standard block-wise optimization (BRECQ) takes ~20 GPU hours for a small 2.7B model, making it impractical for frequent updates or larger models
Existing fast methods (RTN, GPTQ) often fail at low bit-widths (e.g., INT2) or require unstable training processes

Concrete Example: When quantizing the 6.7B parameter OPT model to INT2, standard methods like OmniQuant result in perplexity > 1000 (collapse), while block-wise methods like BRECQ run out of memory on a single A100 GPU.

Key Novelty

Attention-centric Efficient and Scalable Post-training Quantization Algorithm (aespa)

Quantizes layers individually (for speed) but optimizes them to minimize the reconstruction error of the *entire attention module output* (for accuracy), capturing cross-layer dependencies
Derives a refined objective function that allows the heavy computation (Hessians involving Attention matrices) to be pre-computed once using calibration data
Optimizes Query, Key, and Value projections separately using upper-bound surrogates for the attention error, avoiding repeated softmax calculations during optimization

Architecture

Overview of aespa's quantization strategy compared to existing methods (Layer-wise vs Block-wise)

Evaluation Highlights

Achieves INT2 quantization on LLaMA-7B with 11.94 perplexity, significantly outperforming OmniQuant (18.18) and AffineQuant (18.83)
Reduces quantization time for OPT-1.3B to 1.24 hours, compared to >10 hours for the baseline BRECQ method (approx. 10x speedup)
Uniformly outperforms RTN, OPTQ, and Z-FOLD across OPT, BLOOM, and LLaMA families in INT2 precision settings

Breakthrough Assessment

8/10

Successfully bridges the gap between fast layer-wise and accurate block-wise quantization. The mathematical derivation allowing pre-computation makes advanced optimization feasible for very large models.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of Transformer weights W to minimize attention output reconstruction error

Inputs: Pre-trained Transformer model weights W, calibration dataset X

Outputs: Quantized integer weights W_int and quantization parameters (scale s, zero-point z)

Pipeline Flow

Pre-computation of Hessians and Correlation Matrices (E[XX^T], E[K^TK], etc.)
Quantization Parameter Determination (Scale/Zero-point via Z-FOLD + proposed Hessian)
Integer Weight Optimization (Adaptive Rounding using proposed pre-computed objectives)

System Modules

Value Projection Quantizer (Quantization Optimization)

Optimize W_V to minimize attention output error

Model or implementation: Layer-wise optimization with attention-aware objective

Query Projection Quantizer (Quantization Optimization)

Optimize W_Q using a surrogate objective to avoid softmax re-computation

Model or implementation: Layer-wise optimization with upper-bound surrogate

Key Projection Quantizer (Quantization Optimization)

Optimize W_K using a surrogate objective

Model or implementation: Layer-wise optimization with upper-bound surrogate

Novel Architectural Elements

Use of attention-output reconstruction error as the objective for individual layer optimization
Modified objective functions for Q, K, V layers that allow total separation of weight updates while retaining cross-layer dependency information via pre-computed statistics
Approximation of attention softmax Jacobian using upper bounds to avoid storing massive L x L x L tensors

Modeling

Base Model: Evaluated on OPT (125M-6.7B), BLOOM (560M-7.1B), LLaMA (7B-30B), LLaMA2 (7B-13B)

Training Method: Post-Training Quantization (Weight Only)

Objective Functions:

Purpose: Optimize Value projection weights.

Formally: min_{ΔW_V} E[||ΔW_V X A^T||_F^2] = tr(ΔW_V E[X A^T A X^T] ΔW_V^T)
Purpose: Optimize Query projection weights (Surrogate).

Formally: min_{ΔW_Q} E[||K ΔW_Q X||_F^2] ≈ Δw_Q^T (E[XX^T] ⊗ E[K^TK]) Δw_Q
Purpose: Optimize Key projection weights (Surrogate).

Formally: min_{ΔW_K} E[||Q ΔW_K X||_F^2] ≈ Δw_K^T (E[XX^T] ⊗ E[Q^TQ]) Δw_K

Training Data:

128 segments of 2048 tokens randomly sampled from C4 dataset

Key Hyperparameters:

weight_rounding_iterations: 2000
learning_rate: 0.015
rounding_loss_weight: 1.5
+ 1 more
batch_size: Full calibration set (via pre-computation)

Compute: Single NVIDIA A100 (80GB) or H100 GPU. Processing time ~1.24 hours for OPT-1.3B (vs >10 hours for BRECQ).

Comparison to Prior Work

vs. BRECQ: aespa quantizes layer-wise but targets attention reconstruction, achieving similar accuracy at 10x speed by avoiding repeated attention map computation
vs. AdaRound: aespa considers Q-K-V dependencies via the attention error objective, whereas AdaRound treats layers independently
vs. OmniQuant: aespa optimizes discrete weight rounding (like AdaRound) rather than approximating gradients for quantization parameters, leading to higher stability in INT2

Limitations

Currently focuses on weight-only quantization; does not address activation quantization
Approximations for Query/Key objectives (upper bounds) might be loose compared to exact block reconstruction
Pre-computation requires memory proportional to hidden size squared (d_h * d^2), though less than storing full Jacobians
Zero-shot evaluation limited to common reasoning tasks (ARC, HellaSwag, MMLU)

Reproducibility

Code: https://github.com/SamsungLabs/aespa

Code publicly available at https://github.com/SamsungLabs/aespa. Uses standard calibration datasets (C4). Comparison methods (OmniQuant, AffineQuant, BRECQ) run using official implementations/configurations.

📊 Experiments & Results

Evaluation Setup

Weight-only quantization (W3A16, W2A16) evaluated on perplexity and zero-shot reasoning

Benchmarks:

WikiText-2 (Language Modeling (Perplexity))
C4 (Language Modeling (Perplexity))
ARC/HellaSwag/MMLU (Zero-shot Reasoning)

Metrics:

Perplexity (PPL)
Zero-shot Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis on LLaMA-7B INT2 quantization shows aespa significantly outperforming recent learnable quantization methods in perplexity.
WikiText-2	Perplexity (INT2)	18.18	11.94	-6.24
WikiText-2	Perplexity (INT2)	18.83	11.94	-6.89
Scalability results on OPT-125M INT2 quantization demonstrate aespa's stability where standard methods fail.
WikiText-2	Perplexity (INT2)	143.9	71.18	-72.72
Zero-shot accuracy evaluation on LLaMA-13B (INT2) confirms perplexity gains translate to downstream tasks.
Average (ARC, HellaSwag, MMLU)	Accuracy (INT2)	43.77	46.91	+3.14
Comparison against OPTQ on BLOOM-7.1B INT2 showing collapse of baseline vs stability of aespa.
WikiText-2	Perplexity (INT2)	194.2	16.31	-177.89

Main Takeaways

aespa enables functional INT2 quantization where baselines often diverge (collapse to PPL > 1000) or degrade significantly
Achieves computational efficiency comparable to layer-wise methods (O(d_h d^2)) while maintaining accuracy comparable to block-wise methods (O(B d_h L^2))
Proposed Hessian metric (considering cross-layer dependency) consistently yields better quantization parameters than the standard Hessian H=E[XX^T]
The method is robust across model families (OPT, BLOOM, LLaMA) and scales effectively to large parameters (up to 30B tested)

📚 Prerequisite Knowledge

Prerequisites

Post-Training Quantization (PTQ) formulations
Transformer architecture (specifically Self-Attention)
Taylor series approximation (for Hessian estimation)
Kronecker product properties

Key Terms

PTQ: Post-Training Quantization—compressing a model after training without full fine-tuning, usually using a small calibration dataset

Hessian: A matrix of second-order derivatives used to measure the curvature of the loss function; in quantization, it indicates how sensitive the error is to changes in specific weights

BRECQ: Block Reconstruction Quantization—a state-of-the-art PTQ method that optimizes weights to reconstruct the output of a full neural network block (e.g., a Transformer block)

AdaRound: Adaptive Rounding—a method that learns whether to round weights up or down to minimize quantization error, rather than just rounding to the nearest integer

Kronecker product: An operation on two matrices that results in a block matrix; used here to decompose large Hessian matrices into smaller, manageable components

INT2: 2-bit Integer Quantization—representing weights using only 2 bits (4 possible values), a highly aggressive compression level

PPL: Perplexity—a measurement of how well a probability model predicts a sample; lower is better

OPTQ: A popular one-shot PTQ method that quantizes weights layer-by-layer using second-order information

Z-FOLD: A technique to effectively merge (fold) normalization parameters into weights to improve quantization resilience

Zero-shot task: Evaluating a model on a task it wasn't explicitly trained for, used here to verify reasoning capabilities after quantization