ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats

📝 Paper Summary

Post-Training Quantization (PTQ) Large Language Model Compression

ZeroQuant-FP demonstrates that floating-point quantization (FP8 for activations, FP4 for weights) significantly outperforms integer equivalents for large language models, especially when handling outliers.

Core Problem

Standard integer quantization (INT8/INT4) degrades LLM performance because uniform quantization handles activation outliers poorly, skewing the representation of the main data distribution.

Why it matters:

LLMs are computationally intensive, requiring quantization for efficient deployment on resource-limited hardware
Existing integer methods often cause unacceptable accuracy drops (e.g., perplexity degradation >1.0) in larger models due to activation outliers
New hardware like NVIDIA H100 supports FP8, creating an opportunity for more precise floating-point quantization formats

Concrete Example: In the OPT-1.3B model, activation values in the 'fc2' module are heavily skewed by the ReLU operator, clustering around zero with large outliers. INT8 uniform quantization attempts to cover the outlier range, causing significant precision loss for the clustered small values, whereas FP8's dynamic exponent allocation captures both effectively.

Key Novelty

ZeroQuant-FP (W4A8 Floating-Point Quantization)

Replaces integer quantization with floating-point formats (FP8 for activations, FP4 for weights) to better handle non-uniform distributions and outliers common in LLMs
Proposes two scaling constraints (power-of-2 scaling) to efficiently cast FP4 weights to FP8 for computation without performance loss
Integrates Low Rank Compensation (LoRC) to correct quantization errors in the weight matrix, particularly beneficial for smaller models

Architecture

Histograms and density plots of activation values across different layers (2nd, 12th, 24th) and modules (Attention, MLP) of OPT-1.3b.

Evaluation Highlights

FP8 activation outperforms INT8 activation on LLaMA-7b W4A8, reducing perplexity from 11.48 to 11.08 (lower is better)
FP4 weights surpass INT4 weights; for LLaMA-7b W4A8, FP4 improves perplexity by 0.95 points (15.14 vs 16.09) compared to INT4
W4A8 quantization with FP8 activation and FP4 weights achieves near-lossless performance compared to FP16 baselines for large models like LLaMA-30b (5.92 vs 5.79 PPL)

Breakthrough Assessment

8/10

Strong empirical evidence that FP formats are superior to INT for LLM quantization, timely aligned with H100 hardware capabilities. The FP4 weight findings are particularly promising for extreme compression.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of Large Language Models

Inputs: Pre-trained LLM weights (FP16) and calibration data

Outputs: Quantized model with FP4 weights and FP8 activations

Pipeline Flow

Calibration (compute activation statistics)
Activation Quantization (FP8 token-wise)
Weight Quantization (FP4 fine-grained group-wise)
LoRC Error Correction (optional)

System Modules

Activation Quantizer (Quantization)

Quantize activations to FP8 format (E4M3 preferred)

Model or implementation: FP8 Quantizer

Weight Quantizer (Quantization)

Quantize weights to FP4 format (E2M1 preferred)

Model or implementation: FP4 Quantizer

Scale Constraint (Optimization)

Restrict weight scaling factors to powers of 2 for efficient hardware shifting

Model or implementation: Bit-shift alignment

LoRC (Optimization)

Compensate for weight quantization errors using low-rank decomposition

Model or implementation: SVD-based error approximation

Novel Architectural Elements

Power-of-2 scaling constraints for weight quantization to enable bit-shifting casting from FP4 to FP8 without expensive dequantization
Application of LoRC specifically to floating-point quantization (FP-LoRC) to mitigate errors from scale constraints

Modeling

Base Model: LLaMA (7B, 13B, 30B) and OPT (1.3B, 6.7B, 13B, 30B)

Training Method: GPTQ (Gradient Projected Transform Quantization) adapted for FP formats

Training Data:

128 random sentences from C4 dataset for calibration
2048 tokens per sentence

Key Hyperparameters:

weight_group_size: 256 (320 for LLaMA-3b)
lorc_dimension: 8 (LLaMA), 16-56 (OPT)
activation_format: FP8 (E4M3)
+ 1 more
weight_format: FP4 (E2M1)

Compute: Single V100-32GB GPU for quantization process

Comparison to Prior Work

vs. ZeroQuant-V2/SmoothQuant: Uses Floating-Point (FP8/FP4) formats instead of Integer (INT8/INT4) to better handle distributions without explicit smoothing
vs. GPTQ: Extends the optimization framework to support FP formats and combines with activation quantization (W4A8) rather than weight-only
vs. AWQ [cited in paper]: Focuses on format efficiency (FP vs INT) rather than activation-aware weight protection, though compatible in principle
+ 1 more
vs. SpQR [not cited in paper]: Addresses outliers via format precision rather than sparse outlier isolation

Limitations

Hardware dependency: FP8/FP4 benefits are most realizable on specific newer hardware (e.g., NVIDIA H100)
Small model sensitivity: Smaller models (e.g., OPT-1.3b) still show degradation compared to larger ones, even with LoRC
Scale constraints: Enforcing power-of-2 scaling for efficiency causes minor quality drops, requiring LoRC to recover
No QAT comparison: Study is strictly Post-Training Quantization (PTQ) and does not explore Quantization-Aware Training

Reproducibility

Code: https://github.com/microsoft/DeepSpeed

Code will be released as part of DeepSpeed (https://github.com/microsoft/DeepSpeed). Uses public LLaMA and OPT checkpoints from Hugging Face. C4 dataset used for calibration. Detailed hyperparameters (group sizes, LoRC ranks) provided.

📊 Experiments & Results

Evaluation Setup

Perplexity evaluation on language modeling tasks

Benchmarks:

WikiText-2 (Language Modeling)
PTB (Language Modeling)
C4 (Language Modeling)

Metrics:

Perplexity (PPL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FP8 Activation generally outperforms INT8 Activation, especially in W4A8 settings where precision is critical.
WikiText-2	Perplexity	6.44	6.32	-0.12
WikiText-2	Perplexity	5.32	5.26	-0.06
FP4 Weights show significant improvement over INT4 Weights when Activations are fixed to FP8/INT8.
Mean (WIKI/PTB/C4)	Perplexity	16.09	15.14	-0.95
Mean (WIKI/PTB/C4)	Perplexity	11.31	11.08	-0.23
LoRC integration effectively recovers performance lost due to quantization errors, particularly in smaller models.
Mean (WIKI/PTB/C4)	Perplexity	15.14	13.95	-1.19
Mean (WIKI/PTB/C4)	Perplexity	15.14	14.49	-0.65

Experiment Figures

Visual comparison of quantization error between INT8 Asymmetric, FP8 E5M2, and FP8 E4M3 on a vector with an outlier.

Main Takeaways

FP8 activation consistently outperforms INT8 activation, with the gap widening for larger models (>6.7B parameters).
FP4 is a superior format for weight quantization compared to INT4, offering better fidelity for the heavy-tailed weight distributions.
Constraining weight scales to powers of 2 (for hardware efficiency) incurs minimal loss, which can be fully recovered or even improved upon using LoRC.
Activation outliers (especially in 'fc2' layers due to ReLU) are the primary cause of INT8 failure; FP8's dynamic range handles these naturally.

📚 Prerequisite Knowledge

Prerequisites

Understanding of quantization concepts (scale, zero-point, precision)
Familiarity with floating-point formats (exponent vs mantissa bits)
Knowledge of LLM architecture (Transformer, Attention, MLP)

Key Terms

PTQ: Post-Training Quantization—reducing model precision after training without full re-training

FP8: 8-bit Floating Point format, specifically E4M3 (4 exponent, 3 mantissa) or E5M2 (5 exponent, 2 mantissa) in this paper

FP4: 4-bit Floating Point format, specifically E2M1 (2 exponent, 1 mantissa) in this paper

W4A8: Quantization scheme using 4-bit weights and 8-bit activations

LoRC: Low Rank Compensation—an error correction method that uses low-rank matrix decomposition to approximate and subtract quantization errors

PPL: Perplexity—a metric for measuring how well a probability model predicts a sample; lower values indicate better performance

FGQ: Fine-Grained Quantization—applying quantization parameters at a granular level (e.g., per group of weights) rather than per tensor

outliers: Extreme values in activation distributions that skew uniform quantization ranges, causing loss of precision for the majority of data