Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

📝 Paper Summary

Model Compression Generative AI

PCR improves the compression of text-to-image models by accounting for error accumulation during the multi-step generation process and selectively keeping sensitive steps at higher precision.

Core Problem

Existing quantization methods for diffusion models ignore how errors accumulate across multiple denoising steps and fail to account for the specific sensitivity of text-to-image models to different timesteps.

Why it matters:

Large diffusion models like Stable Diffusion XL are computationally expensive, making deployment on consumer hardware difficult without compression
Current evaluation metrics (standard FID on COCO) are inaccurate for large-scale text-to-image models due to data distribution gaps, potentially blocking progress in the field
Previous quantization approaches result in significant degradation of image fidelity or text-image alignment because they treat all timesteps equally

Concrete Example: When quantizing Stable Diffusion XL to 8-bit using previous methods like Q-diffusion, the model loses the ability to match textual semantics (e.g., generating a generic scene instead of the specific prompt description), whereas the proposed method maintains alignment.

Key Novelty

Progressive Calibration and Activation Relaxing (PCR)

Progressive Calibration: Instead of calibrating with a full-precision model, it quantizes time step t using data generated where all previous steps (t+1 to T) were already quantized, effectively 'training' the quantization to handle accumulated errors
Activation Relaxing: Identifies that models have specific 'sensitive' steps (early steps for fidelity, later steps for text alignment) and keeps those few steps at higher precision (e.g., 10-bit) while quantizing the rest heavily

Architecture

Overview of the PCR method, illustrating the Progressive Calibration (step-by-step quantization awareness) and Activation Relaxing (mixed precision for sensitive steps).

Evaluation Highlights

First successful quantization of Stable Diffusion XL (3.5B parameters) while maintaining performance, achieving 6.84 FID on QDiffBench compared to 6.78 for the full-precision model
Outperforms the state-of-the-art Q-diffusion method on Stable Diffusion, achieving 8.64 FID (vs. Q-diffusion's 10.96) under W8A8 quantization settings
The proposed activation relaxing strategy improves CLIP Score on Stable Diffusion XL from 0.310 (W8A8) to 0.319 (PCR), matching the full-precision model's 0.319

Breakthrough Assessment

8/10

Strong contribution by being the first to effectively quantize SDXL and identifying critical flaws in previous evaluation benchmarks. The progressive calibration idea is theoretically grounded and practically effective.

⚙️ Technical Details

Problem Definition

Setting: Post-training quantization of weights and activations in multi-step diffusion processes

Inputs: Pretrained text-to-image diffusion model (e.g., Stable Diffusion), calibration dataset

Outputs: Quantized model with reduced bit-width weights and activations

Pipeline Flow

Progressive Calibration: Step-by-step quantization from T to 1
Sensitivity Analysis: Identify sensitive steps (fidelity vs. text-match)
Activation Relaxing: Assign higher bits to sensitive steps during inference

System Modules

Progressive Calibrator

Determines quantization parameters for step t using input generated by the model quantized at steps t+1...T

Model or implementation: Same as base model

Relaxation Selector

Allocates higher bit-widths to specific timesteps based on sensitivity analysis

Model or implementation: Heuristic / Analysis

Novel Architectural Elements

Progressive calibration loop that updates the calibration data dynamically based on previous quantization errors
Time-wise mixed-precision schedule (Activation Relaxing) applied specifically to diffusion timesteps

Modeling

Base Model: Stable Diffusion (v1.4, v1.5) and Stable Diffusion XL

Training Method: Post-training Quantization (PTQ)

Objective Functions:

Purpose: Minimize quantization error at each timestep t given accumulated error from previous steps.

Formally: minimize || delta_t || where delta_t is error between quantized network output on quantized input and full-precision network output on quantized input.

Adaptation: Quantization of weights and activations (W8A8, W4A8)

Training Data:

Uses 128 samples for calibration
Proposed QDiffBench uses 1000 samples for evaluation

Key Hyperparameters:

calibration_samples: 128
relaxation_proportion_tau: < 0.20 (typically 0.05 or 0.10)
relaxed_bitwidth: 16-bit or 10-bit (vs 8-bit base)

Compute: Not reported in the paper

Comparison to Prior Work

vs. PTQ4DM/Q-diffusion: PCR considers accumulated quantization error across timesteps (progressive calibration) instead of treating steps independently.
vs. Q-diffusion: PCR introduces activation relaxing to handle sensitivity discrepancy between image fidelity and text alignment.
vs. Standard Evaluation: PCR proposes QDiffBench to fix distribution gap issues in COCO-based evaluation used by prior arts.

Limitations

Relaxing activation bits increases computational cost slightly compared to uniform quantization (though claimed negligible)
Requires sequential calibration process which might be slower than parallel calibration strategies
Evaluation limited to Stable Diffusion architectures; applicability to other diffusion types (e.g., video) not tested
Benchmark relies on FID which has known limitations in assessing aesthetic quality

Reproducibility

Code availability is marked as not yet released. The paper details the algorithm (Algorithm 1) and hyperparameters (calibration size, relaxation rates) sufficient for expert reimplementation. It relies on standard models (Stable Diffusion) and datasets (COCO, QDiffBench prompts).

📊 Experiments & Results

Evaluation Setup

Text-to-Image Generation Quantization

Benchmarks:

QDiffBench (Text-to-Image Generation Evaluation) [New]
COCO (standard) (Text-to-Image Generation Evaluation)

Metrics:

FID (Fréchet Inception Distance)
CLIP Score
Image-Complexity (variation of content)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on Stable Diffusion XL (SDXL) showing PCR's ability to maintain performance where baselines fail.
QDiffBench (SDXL)	FID	7.94	6.84	-1.10
QDiffBench (SDXL)	CLIP Score	0.310	0.319	+0.009
Results on Stable Diffusion (SD v1.4) demonstrating improvements over Q-diffusion.
QDiffBench (SD v1.4)	FID	10.96	8.64	-2.32
QDiffBench (SD v1.4)	CLIP Score	0.279	0.283	+0.004
Ablation studies validating the components of PCR (Progressive Calibration and Relaxing).
QDiffBench (SDXL)	FID	8.84	6.84	-2.00

Experiment Figures

Visual comparison of activation distributions at timestep 1 between full-precision pipeline and quantized pipeline.

Sensitivity analysis: CLIP Score and FID changes when specific timesteps are kept at full precision.

Main Takeaways

PCR significantly outperforms Q-diffusion on both Stable Diffusion and SDXL, especially in FID scores.
Activation relaxing is critical for SDXL to maintain text-image alignment (CLIP Score), which degrades heavily with standard quantization.
The proposed QDiffBench reveals that previous COCO-based metrics underestimate the degradation caused by quantization due to distribution gaps.
SD is sensitive to image fidelity degradation (early steps), while SDXL is sensitive to text-matching degradation (later steps), requiring different relaxation strategies.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Models (Forward/Reverse process)
Post-training Quantization (PTQ) concepts (calibration, scaling factors)
FID (Fréchet Inception Distance) and CLIP Score metrics

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

PTQ: Post-training Quantization—compressing a model after training without a full retraining process, usually using a small calibration dataset

QDiffBench: A new benchmark proposed in this paper that evaluates quantized diffusion models using same-domain data for FID and testing generalization on unseen prompts

FID: Fréchet Inception Distance—a metric used to assess the quality of images generated by a generative model by comparing the distribution of generated images to real images

CLIP Score: A metric evaluating how well an image matches its text caption, using the CLIP (Contrastive Language-Image Pre-training) model embeddings

Stable Diffusion XL: A large-scale (3.5B parameter) latent text-to-image diffusion model

W8A8: Quantization setting where Weights are 8-bit and Activations are 8-bit

W4A8: Quantization setting where Weights are 4-bit and Activations are 8-bit

PCR: Progressive Calibration and Relaxing—the authors' proposed method

Activation Relaxing: A mixed-precision strategy where a small subset of sensitive timesteps use higher bit-width (e.g., 10-bit) while others use lower bit-width

COCO: Common Objects in Context—a large-scale object detection, segmentation, and captioning dataset often used for benchmarking

Denoising steps: The iterative steps a diffusion model takes to convert noise into a clear image

Distribution gap: The difference in statistical characteristics between two datasets (e.g., COCO images vs. images generated by Stable Diffusion)

Calibration data: A small set of data used during PTQ to determine quantization parameters (like scaling factors)