Efficient Post-training Quantization with FP8 Formats

📝 Paper Summary

Model Compression Efficient Inference

This paper demonstrates that FP8 formats outperform INT8 for post-training quantization by providing better dynamic range for outliers in NLP models and higher precision for computer vision tasks.

Core Problem

INT8 quantization struggles with limited dynamic range, causing significant accuracy loss in modern deep learning models (especially LLMs) due to large outliers in activations, often requiring complex calibration or fallback to higher precision.

Why it matters:

Large Language Models (LLMs) contain massive outliers (e.g., in LayerNorm) that break standard integer quantization schemes.
Current INT8 methods often fail to maintain accuracy without complex, hardware-unfriendly workarounds like mixed-precision or outlier suppression.
A significant percentage of modern workloads cannot be quantized to INT8 effectively, limiting deployment efficiency on resource-constrained devices.

Concrete Example: In INT8 quantization, a single large outlier value stretches the quantization grid, reducing the precision available for the vast majority of small values near zero. The paper shows that for Stable Diffusion, INT8 produces image artifacts (e.g., loss of detail on an astronaut's suit), whereas FP8 formats generate smooth, high-quality images comparable to the original.

Key Novelty

Unified FP8 Post-Training Quantization Workflow

Evaluates three FP8 variants (E5M2, E4M3, E3M4) to balance dynamic range versus precision across 75+ diverse models.
Introduces a mixed-precision strategy assigning different FP8 formats to weights (precision-bound) and activations (range-bound) based on their specific distributions.
Extends quantization to sensitive operations usually left in float32, such as LayerNorm and BatchNorm, which FP8 can handle due to its non-uniform grid.

Architecture

The Post-Training FP8 Quantization Workflow.

Evaluation Highlights

FP8 achieves 92.64% workload coverage (models meeting 1% accuracy loss threshold) compared to only 65.87% for INT8 across 75 models.
E4M3 is identified as the optimal format for NLP models (96.32% coverage), while E3M4 marginally outperforms on Computer Vision (78.95% coverage).
FP8 quantization enables the quantization of LayerNorm and BatchNorm layers without accuracy loss, unlike INT8 which typically requires keeping them in FP32.

Breakthrough Assessment

7/10

Strong empirical evidence across a massive set of models (75+) establishing FP8 as a superior standard over INT8 for future hardware. While the methods are standard PTQ techniques applied to new formats, the scale and analysis of the E4M3/E3M4 trade-offs are significant.

⚙️ Technical Details

Problem Definition

Setting: Post-training quantization of weights and activations from FP32 to 8-bit Floating Point formats.

Inputs: Pre-trained FP32 Deep Neural Networks (CNNs, Transformers, LLMs, Diffusion models).

Outputs: Quantized FP8 models (weights and activations) maintaining <1% relative accuracy loss.

Pipeline Flow

FP32 Model Input
Standard Quantization Scheme (Conv, Linear, Embedding)
Extended Quantization Scheme (LayerNorm, BatchNorm, etc.)
Mixed Format Selection (Weights vs. Activations)
Quantized Model Output

System Modules

Standard Quantization (Quantization Core)

Quantize common compute-heavy operators (Convolution, Linear, Embedding) using per-channel scaling for weights and per-tensor scaling for activations.

Model or implementation: Applies E5M2, E4M3, or E3M4

Extended Quantization (Quantization Core)

Quantize memory-bound and sensitive operators (LayerNorm, BatchNorm, Add, Mul) usually skipped in INT8.

Model or implementation: FP8 Formats

Format Selection Strategy

Assign specific FP8 formats based on distribution (e.g., E4M3 for range-bound activations, E3M4 for precision-bound weights).

Model or implementation: Heuristic Selection

Novel Architectural Elements

Mixed FP8 Format assignment: Using different FP8 variants (e.g., E4M3 vs E3M4) within the same model or even single operation (weights vs activations) based on distribution analysis.
Full quantization of sensitive layers: Applying quantization to LayerNorm and BatchNorm, which are typically kept in FP32 in standard INT8 pipelines.

Modeling

Base Model: Evaluated on 75 distinct architectures including LLaMA, BLOOM, BERT, ResNet, ViT, Stable Diffusion, YOLOv3.

Training Method: Post-training quantization with calibration (no gradient updates/backprop)

Adaptation: BatchNorm statistics tuning (re-estimating mean/variance) for Computer Vision models using augmented calibration data.

Training Data:

Calibration sets (typically small, e.g., 3K samples)
Data augmentation used for BatchNorm calibration

Key Hyperparameters:

calibration_sample_size: 3000 (recommended)
scaling_method: Max scaling (for E4M3/E3M4)
weight_scaling: Per-channel
+ 1 more
activation_scaling: Per-tensor

Compute: Software emulation framework (FP8 Emulation Toolkit) running on FP32 hardware.

Comparison to Prior Work

vs. INT8: FP8 provides non-uniform grid points, better handling the 'long-tailed' distributions of neural network data without complex clipping.
vs. SmoothQuant: FP8 formats (specifically E4M3) natively handle outliers via exponent bits, reducing the need for pre-quantization smoothing transformations.
vs. Integer-only LayerNorm: FP8 allows quantization of LayerNorm/BatchNorm layers which usually break INT8 accuracy, allowing for more comprehensive model compression.

Limitations

Results are based on software emulation, not native hardware performance (latency/energy) measurements.
First convolution and last fully-connected layers in CNNs still require higher precision (or careful handling) to maintain accuracy.
Dynamic quantization showed benefits for NLP but not consistently across all model types.
Requires hardware support for multiple FP8 formats (E5M2, E4M3, E3M4) to fully utilize the proposed mixed-precision benefits.

Reproducibility

The paper uses the FP8 Emulation Toolkit and Neural Compressor. Code availability is not provided in the paper text. The specific models are open-source (Hugging Face, TorchVision), but the specific quantization scripts are not explicitly linked.

📊 Experiments & Results

Evaluation Setup

Post-training quantization on >200 tasks across 75 models.

Benchmarks:

ImageNet ILSVRC 2012 (Image Classification)
COCO2014 (Object Detection)
LAMBADA (OpenAI) (Language Modeling)
MRPC / CoLA / STS-B (Text Classification (GLUE))
LibriSpeech (Speech Recognition)

Metrics:

Pass Rate (% of models with <1% relative accuracy loss)
Top-1 Accuracy
F1 Score
FID (Fréchet Inception Distance)
mAP (mean Average Precision)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Overall workload coverage comparison showing FP8 formats significantly outperform INT8 in maintaining model accuracy within 1% of the baseline.
75 unique network architectures	Pass Rate (Workload Coverage)	65.87	92.64	+26.77
Specific domain coverage analysis reveals E4M3 is best for NLP while E3M4 leads in Computer Vision.
NLP Models (38 networks)	Pass Rate (Coverage)	92.11	96.32	+4.21
CV Models (34 networks)	Pass Rate (Coverage)	73.68	78.95	+5.27
Case study on BERT-Large (SQuAD) shows FP8 matches FP32 accuracy where INT8 fails significantly.
BERT-Large (SQuAD)	F1 Score	86.11	90.87	+4.76

Experiment Figures

Comparison of value distribution and quantization error for FP8 formats vs INT8 on Gaussian data.

Main Takeaways

FP8 formats consistently outperform INT8 in workload coverage (92.64% vs 65.87%), effectively making quantization viable for a much broader range of models without retraining.
There is a clear split in optimal formats: NLP models prefer E4M3 (balancing range for outliers), while Computer Vision models prefer E3M4 (higher precision for weights).
Mixed precision strategies (e.g., E4M3 for activations, E3M4 for weights) further optimize accuracy, particularly for Transformer-based linear layers.
FP8 enables the quantization of sensitive layers like LayerNorm and BatchNorm, which are typically bottlenecks in INT8 pipelines, simplifying the deployment graph.

📚 Prerequisite Knowledge

Prerequisites

Understanding of floating-point representation (sign, exponent, mantissa)
Basics of neural network quantization (calibration, scaling factors)
Familiarity with INT8 vs. FP32 data formats

Key Terms

FP8: 8-bit Floating Point format. Variations include E5M2 (5 exponent bits), E4M3 (4 exponent bits), and E3M4 (3 exponent bits).

E5M2: An FP8 format with 5 exponent bits and 2 mantissa bits, offering high dynamic range but lower precision; similar to FP16.

E4M3: An FP8 format with 4 exponent bits and 3 mantissa bits, offering a balance of range and precision.

E3M4: An FP8 format with 3 exponent bits and 4 mantissa bits, offering higher precision but limited dynamic range.

Post-training Quantization (PTQ): Compressing a model after training without retraining it, usually using a small calibration dataset.

Mantissa: The part of a floating-point number that represents significant digits (precision).

Dynamic Range: The ratio between the largest and smallest non-zero values a format can represent.

LayerNorm: Layer Normalization, a technique to stabilize training, known to produce large outliers in LLMs.

Calibration: The process of estimating the range of activation values (e.g., min/max) to determine scaling factors for quantization.

Per-channel scaling: Assigning a separate scaling factor to each channel of a weight tensor to minimize error.

FID: Fréchet Inception Distance, a metric used to assess the quality of images generated by generative models (lower is better).