PTQD: Accurate Post-Training Quantization for Diffusion Models

📝 Paper Summary

Efficient Diffusion Models Model Quantization

PTQD enables accurate low-bit diffusion models without retraining by disentangling quantization noise into correlated and uncorrelated components for separate correction and using step-aware mixed precision to preserve signal-to-noise ratio.

Core Problem

Directly quantizing diffusion models creates noise that biases the estimated mean and clashes with the variance schedule, while accumulating errors lead to severe SNR degradation in later denoising steps.

Why it matters:

Diffusion models are computationally expensive and slow, but their iterative nature makes them uniquely sensitive to quantization noise compared to single-step models like CNNs
Existing Post-Training Quantization (PTQ) methods treat diffusion models as generic networks, ignoring the specific signal-to-noise ratio requirements of the denoising process
Retraining (Quantization-Aware Training) is resource-intensive; effective PTQ is needed for deployment on edge devices

Concrete Example: When quantizing Stable Diffusion to 4-bit (W4A8) using standard methods like Q-Diffusion, generated faces on CelebA-HQ appear corrupted with severe artifacts (Figure 1). In late denoising steps, the signal-to-noise ratio of a W4A4 model drops to near 1, meaning the quantization noise is as strong as the signal itself.

Key Novelty

Unified Quantization Noise Correction & Step-Aware Mixed Precision

Disentangles quantization noise into a correlated part (linearly related to the signal) and an uncorrelated residual; corrects the former by rescaling and the latter by bias subtraction
Calibrates the diffusion variance schedule to 'absorb' the extra variance from uncorrelated quantization noise, effectively hiding the error within the generative process's inherent stochasticity
Allocates higher activation bitwidths only to later denoising steps where Signal-to-Noise Ratio (SNR) is critical, using lower bitwidths elsewhere to maximize speed

Architecture

The workflow for correcting quantization noise during the diffusion sampling process

Evaluation Highlights

Achieves 6.44 FID on ImageNet 256x256 with mixed precision (W4A8/W4A4), outperforming state-of-the-art Q-Diffusion (9.97 FID) by a large margin
Matches full-precision performance on ImageNet (W4A8) with only +0.06 FID increase while reducing bit operations by 19.96x
Prevents catastrophic failure on LSUN-Churches (mixed precision), achieving 17.99 FID versus Q-Diffusion's 218.59 FID

Breakthrough Assessment

8/10

Offers a mathematically grounded solution to diffusion quantization by leveraging the generative process's own variance parameters. The performance retention at 4-bit weights is remarkable compared to prior baselines.

⚙️ Technical Details

Problem Definition

Setting: Post-training quantization of a pre-trained noise prediction network epsilon_theta(x_t, t) within a diffusion model (DDPM/LDM)

Inputs: Noisy latent x_t and timestep t

Outputs: Estimated noise epsilon_theta (corrected for quantization bias/variance)

Pipeline Flow

Step-Aware Bitwidth Selection (Inference Step t) -> Quantized Noise Prediction -> Correlated Noise Correction -> Bias Correction -> Variance Schedule Calibration -> Denoising Step

System Modules

Step-Aware Mixed Precision Selector

Select optimal activation bitwidth for current timestep t based on SNR requirements

Model or implementation: Lookup Table

Quantized Noise Prediction Network

Predict noise using quantized weights and activations

Model or implementation: Quantized U-Net (LDM-4/LDM-8)

Noise Correction Block

Correct mean deviation and correlated variance in predicted noise

Model or implementation: Analytical Scaling & Subtraction

Variance Calibrator

Adjust diffusion variance schedule to absorb uncorrelated quantization noise

Model or implementation: Analytical Formula

Novel Architectural Elements

Integration of quantization noise statistics directly into the diffusion sampling equations (Eq. 9, Eq. 12)
Step-aware bitwidth allocation strategy that maps denoising timesteps to activation precision levels based on SNR analysis

Modeling

Base Model: Latent Diffusion Models (LDM-4, LDM-8)

Training Method: Post-Training Quantization (BRECQ for calibration)

Key Hyperparameters:

calibration_samples: 1024
weight_bits: 4
activation_bits: 4 or 8 (Mixed Precision)
+ 1 more
quantization_scheme: Uniform Quantization

Compute: Inference speedup: 2.03x (W8A8) to 3.34x (W4A4) on RTX3090 relative to FP32

Comparison to Prior Work

vs. Q-Diffusion: PTQD explicitly disentangles and corrects quantization noise components (correlated/uncorrelated) and adjusts the diffusion variance schedule, whereas Q-Diffusion only calibrates weights/activations
vs. Q-Diffusion: PTQD uses dynamic bitwidths per timestep (Step-aware Mixed Precision) to handle SNR collapse, while Q-Diffusion uses static mixed precision or fixed bitwidths
vs. PTQ4DM: PTQD targets aggressive 4-bit quantization where PTQ4DM focuses on 8-bit

Limitations

Variance Schedule Calibration cannot fully compensate if the original variance schedule is zero (deterministic sampling like DDIM with eta=0)
Mixed precision scheme introduces slight overhead to switch activation quantization parameters during inference
Requires pre-computation of statistics involving generating 1024 samples, which is a one-time cost but non-negligible
Performance gains in latency depend on hardware support for mixed-precision operations

Reproducibility

Code: https://github.com/ziplab/PTQD

Code is publicly available. Method requires a calibration phase to generate 1024 samples and collect statistics before deployment. Uses standard BRECQ and AdaRound implementations for the underlying quantization optimization.

📊 Experiments & Results

Evaluation Setup

Image synthesis on standard benchmarks using pre-trained Latent Diffusion Models

Benchmarks:

ImageNet (Class-conditional image generation (256x256))
LSUN-Bedrooms (Unconditional image generation (256x256))
LSUN-Churches (Unconditional image generation (256x256))

Metrics:

FID (Fréchet Inception Distance)
sFID (Spatial FID)
IS (Inception Score)
BOPs (Bit Operations)
Statistical methodology: Normality tests (D'Agostino and Pearson) used to verify Gaussian assumption of quantization noise

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on ImageNet shows PTQD maintains near-full-precision quality even at aggressive 4-bit weight settings, while baselines degrade.
ImageNet 256x256	FID	5.05	5.11	+0.06
ImageNet 256x256	BOPs (Teracycle)	102.21	5.12	-97.09
Mixed Precision (MP) experiments demonstrate PTQD's robustness in low-bit regimes where baselines fail.
ImageNet 256x256	FID	9.97	6.44	-3.53
LSUN-Bedrooms	FID	5.75	5.49	-0.26
LSUN-Churches	FID	218.59	17.99	-200.60

Experiment Figures

Visual comparison of samples generated by Full Precision, Q-Diffusion (W4A8), and PTQD (W4A8) on CelebA-HQ

Scatter plots showing the correlation between quantization noise and the original signal at different timesteps

Signal-to-Noise Ratio (SNR) of the quantized noise prediction network across timesteps for different bitwidths

Main Takeaways

Correlated quantization noise is a major source of error; correcting it analytically (via correlation coefficient) significantly improves sample quality
Step-aware mixed precision is critical for 4-bit diffusion; simply applying static quantization leads to SNR collapse in later denoising steps
The proposed Variance Schedule Calibration effectively 'hides' quantization noise within the diffusion process's inherent stochasticity, provided the original schedule allows it
PTQD makes 4-bit weight quantization viable for high-fidelity diffusion models, offering ~20x bit-operation reduction with negligible visual degradation

📚 Prerequisite Knowledge

Prerequisites

Understanding of Diffusion Models (forward/reverse process)
Basics of Neural Network Quantization (uniform quantization, bitwidth)
Statistical concepts (mean, variance, correlation)

Key Terms

PTQ: Post-Training Quantization—compressing a model to lower precision (e.g., 8-bit) using a small calibration set without full retraining

SNR: Signal-to-Noise Ratio—the ratio of useful signal power to noise power; in diffusion, this drops in later steps, making them fragile

FID: Fréchet Inception Distance—a metric for evaluating generated image quality by comparing feature distributions of real and generated images

sFID: Spatial FID—a variant of FID that is more sensitive to spatial structure and coherence

BOPs: Bit Operations—a measure of computational cost calculated as MACs × weight_bits × activation_bits

Mixed Precision: Using different levels of precision (e.g., 4-bit vs 8-bit) for different parts of the model or different steps in the process

LDM: Latent Diffusion Model—a diffusion model that operates in a compressed latent space (e.g., Stable Diffusion)

Correlation Coefficient (k): A statistic measuring the linear relationship between the quantization noise and the original full-precision signal