← Back to Paper List

A Frustratingly Easy Post-Training Quantization Scheme for LLMs

Yongkweon Jeon, Chungman Lee, Kyungphil Park, Ho-Young Kim
Samsung Research
Conference on Empirical Methods in Natural Language Processing (2023)
Pretraining Benchmark

📝 Paper Summary

Model Compression Post-Training Quantization (PTQ)
Z-FOLD improves low-bit LLM quantization by introducing extra scaling parameters that are mathematically fused into adjacent layers, enhancing accuracy without adding inference overhead.
Core Problem
Post-training quantization of Large Language Models (LLMs) to very low bit-widths (e.g., 2-bit) causes massive accuracy degradation (loss perturbation) because standard scaling factors are insufficient to model weight distributions.
Why it matters:
  • Hyper-scale models (100B+ parameters) face severe memory bottlenecks during inference, making 2-bit quantization highly desirable for deployment on commodity hardware
  • Existing methods like OPTQ and RTN suffer from 'collapse' (perplexity explosion) at 2-bit precision, rendering the models unusable
  • Prior solutions often require additional parameters or hardware changes, negating the efficiency gains of quantization
Concrete Example: When quantizing LLaMA-30B to 2-bit precision, the state-of-the-art method OPTQ collapses, resulting in a perplexity of 2065 (garbage output) compared to the FP16 baseline of 4.10. Z-FOLD maintains a perplexity of 9.65.
Key Novelty
Z-FOLD (Rank-1 Decomposition + Folding)
  • Decomposes the quantization step-size matrix into two vectors (alpha and zeta) to better fit the weight distribution, effectively using more parameters for higher precision
  • Utilizes the linear properties of Transformer architectures to 'fold' (fuse) the extra parameter (zeta) into the weights of the preceding layer (e.g., LayerNorm or Linear)
  • Ensures the final inference model structure is identical to the original, incurring zero additional latency or memory cost at runtime
Evaluation Highlights
  • Prevents model collapse on LLaMA-30B at 2-bit precision: achieves 9.65 perplexity vs. OPTQ's 2065 (a usability rescue rather than just improvement)
  • Outperforms OPTQ on OPT-6.7B (2-bit) by reducing perplexity from 348.2 to 19.36 on WikiText-2
  • Achieves state-of-the-art results among post-training quantization methods on zero-shot tasks (e.g., LAMBADA) where baselines fail completely at 2-bit
Breakthrough Assessment
8/10
Significantly extends the viability of 2-bit quantization for LLMs where previous SOTA (OPTQ) failed completely. The 'folding' mechanism is a clever architectural exploitation that adds expressivity with zero inference cost.
×