Junhan Kim, Ho-Young Kim, Eulrang Cho, Chung-Kuei Lee, Joonyoung Kim, Yongkweon Jeon
Samsung Labs
International Conference on Machine Learning
(2024)
PretrainingMemory
📝 Paper Summary
Model CompressionPost-Training Quantization (PTQ)Large Language Models (LLMs)
BoA is a post-training quantization algorithm that optimizes weights by minimizing the attention reconstruction error rather than just layer-wise error, using a relaxed Hessian approximation to avoid backpropagation.
Core Problem
Existing quantization methods for LLMs either rely on slow backpropagation (impractical for billions of parameters) or assume layer independence (like GPTQ), which neglects how errors propagate through the attention mechanism.
Why it matters:
Neglecting inter-layer dependencies, specifically within the attention module, leads to significant performance degradation in quantized Transformers.
Gradient-based optimization methods (e.g., AdaRound) are too computationally expensive for hyper-scale LLMs.
Computing the exact Hessian for attention layers involves a massive Jacobian of the softmax function, requiring prohibitive memory (e.g., >400GB for a small 125M model).
Concrete Example:When quantizing the Query projection matrix, GPTQ minimizes the error of the projection output itself. However, this ignores that the output subsequently passes through a Softmax function and multiplies with the Value matrix. A small error in the projection might be amplified or suppressed by the Softmax/Value interaction, which GPTQ fails to capture.
Key Novelty
Attention-aware Relaxed Hessian Optimization
Approximates the Hessian using the 'attention reconstruction error' (output of the entire attention block) instead of layer-wise error, capturing dependencies between Query, Key, and Value layers.
Introduces a 'Relaxed Hessian' that uses a surrogate upper bound for the error, eliminating the need to compute the memory-intensive Jacobian of the softmax function.
Utilizes properties of the Kronecker product to invert the larger, dependency-aware Hessian matrices efficiently without increasing computational complexity order.
Architecture
Conceptual illustration of the Head-wise Simultaneous Quantization strategy.
Evaluation Highlights
Achieves >40x reduction in processing time on a 30B model by utilizing head-wise simultaneous quantization compared to sequential row updates.
Claims to outperform existing backpropagation-free methods (like GPTQ) by a significant margin, particularly in low-bit precision settings (e.g., INT2) (Qualitative claim; exact accuracy numbers not in provided text).
Claims state-of-the-art performance for weight-activation quantization when combined with outlier suppression methods like QuaRot (Qualitative claim).
Breakthrough Assessment
7/10
Significant methodological improvement by incorporating inter-layer dependencies into a backprop-free framework. Effectively bridges the gap between fast layer-wise methods (GPTQ) and slow block-wise methods (BRECQ).
⚙️ Technical Details
Problem Definition
Setting: Post-Training Quantization (PTQ) of Large Language Models without access to full training data or backpropagation.
Inputs: Pre-trained LLM weights W, small calibration dataset X
Outputs: Quantized discrete weights W_q minimizing task loss degradation
Modeling
Base Model: LLMs (Experiments mentioned on LLaMA-7B, LLaMA-30B, OPT-125M)
Code is publicly available at https://github.com/SamsungLabs/BoA. The paper provides full derivations for the Hessians, including adjustments for RoPE. Specific accuracy results and hyperparameters were not included in the provided text snippet.
📊 Experiments & Results
Evaluation Setup
Post-training quantization of pre-trained LLMs using calibration data.
Benchmarks:
LLaMA (7B, 30B) (Language Modeling)
OPT (125M) (Language Modeling)
Metrics:
Processing Time (Quantization Latency)
Perplexity (implied by typical PTQ papers, though numbers missing in text)
Accuracy (implied)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
LLaMA-30B
Processing Time Reduction
1.0
40.0
39.0
Main Takeaways
BoA significantly reduces quantization processing time (over 40x on LLaMA-30B) by exploiting head-wise independence to quantize multiple rows simultaneously.
The method theoretically allows for compensating quantization errors in one row by updating weights in other rows (unlike GPTQ), due to the non-diagonal structure of the attention-aware Hessian.
The relaxed Hessian formulation effectively bypasses the memory bottleneck of the Softmax Jacobian (which would otherwise require >400GB memory), making attention-aware quantization feasible on standard hardware.
Quantitative accuracy results (Perplexity/Accuracy) are claimed to be SOTA, particularly for low-bit settings, but specific numbers were not present in the provided text snippet.
PTQ: Post-Training Quantization—reducing the precision of a model's weights after training is complete, usually using a small calibration dataset.
Hessian: A square matrix of second-order partial derivatives of a function; in quantization, it represents the curvature of the loss landscape and indicates how sensitive the loss is to changes in weights.
GPTQ: Generative Pre-trained Transformer Quantization—a popular PTQ method that quantizes weights layer-by-layer using second-order information (Hessian) assuming layer independence.
Kronecker product: An operation on two matrices that results in a block matrix; BoA uses its algebraic properties (e.g., inverse of Kronecker product is Kronecker product of inverses) to speed up computation.
RoPE: Rotary Positional Embedding—a method for encoding position information in Transformers by rotating the Query and Key vectors; BoA derives specific Hessians to account for this rotation.
Relaxed Hessian: An approximation of the true Hessian matrix introduced by BoA that avoids computing the computationally expensive Jacobian of the softmax function while still bounding the error.