Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

📝 Paper Summary

Post-Training Quantization (PTQ) Model Compression

MRECG identifies that loss oscillation in post-training quantization is caused by capacity mismatches between adjacent modules and resolves it by jointly optimizing those specific modules.

Core Problem

Existing PTQ methods suffer from 'oscillation' where reconstruction loss spikes during layer-by-layer or block-by-block optimization, causing irreversible accuracy degradation.

Why it matters:

Oscillation indicates that error accumulation in specific layers is breaking through the model's capacity to recover, leading to severe performance drops in low-bit settings (e.g., 2-bit)
Current methods like BRECQ and AdaRound treat all blocks uniformly or randomly, ignoring the structural causes of error spikes
The problem is particularly acute in compact models like MobileNetV2 where depthwise convolutions create significant capacity bottlenecks

Concrete Example: In MobileNetV2, a depthwise convolution layer often has much lower capacity than its adjacent layers. When quantizing sequentially, this capacity drop causes the reconstruction loss to spike (oscillate) rather than decrease smoothly, leading to a 6.61% accuracy drop in 2/4bit quantization compared to the proposed method.

Key Novelty

Mixed REConstruction Granularity (MRECG)

Theoretically proves that loss oscillation is caused by the difference in 'Module Capacity' (ModCap) between adjacent layers; small capacity in a later module amplifies error accumulation
Proposes a metric to quantify module capacity using parameter counts, bit-width, and stride scaling (or reconstruction loss for data-dependent scenarios)
Dynamically merges adjacent modules with the largest capacity differences into a single optimization block (joint optimization) to smooth out oscillations

Architecture

The workflow of MRECG, showing capacity estimation, ranking of capacity differences, and the selective joint optimization of modules.

Evaluation Highlights

+6.61% Top-1 accuracy on MobileNetV2x0.5 (2/4bit) compared to BRECQ, achieving 41.16%
+6.19% Top-1 accuracy on MobileNetV2x1.0 (2/4bit) compared to BRECQ, achieving 58.49%
+1.9% Top-1 accuracy on ResNet-50 (2/4bit) compared to BRECQ, achieving 70.04%

Breakthrough Assessment

8/10

Identifies a fundamental theoretical oversight in PTQ (oscillation) and provides a principled, effective solution that yields massive gains in difficult low-bit settings.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) where a pre-trained model is quantized using a small calibration dataset without full retraining.

Inputs: Pre-trained float32 model weights W and a small calibration dataset X.

Outputs: Quantized model weights and activations maximizing accuracy (minimizing reconstruction loss).

Pipeline Flow

Capacity Estimation (Data-Free or Data-Dependent)
Ranking & Selection of Top-k Capacity Differences
Joint Optimization of Selected Modules

System Modules

Capacity Estimator

Calculates the capacity of each module to identify bottlenecks

Model or implementation: Mathematical formula (Eq. 3 in paper)

Granularity Selector

Decides which adjacent modules should be merged for joint optimization

Model or implementation: Ranking algorithm

Quantizer (BRECQ/QDROP base)

Performs the actual quantization and weight reconstruction on the defined granularity

Model or implementation: BRECQ or QDROP optimization loop

Novel Architectural Elements

Adaptive reconstruction granularity: Unlike fixed layer-wise or block-wise methods, the pipeline dynamically changes optimization boundaries based on capacity analysis.

Modeling

Base Model: ResNet-18, ResNet-50, MobileNetV2 (various width multipliers)

Training Method: Reconstruction via block-wise optimization (AdaRound/BRECQ style) with adaptive granularity

Objective Functions:

Purpose: Minimize capacity mismatch between adjacent modules.

Formally: argmin_m sum (CM_l - CM_{l+1})^2 * m_l + lambda * (m * 1 - k)^2
Purpose: Minimize reconstruction error of the selected block.

Formally: E[ || f(W,X) - f(W_hat, X_hat) ||_F^2 ]

Key Hyperparameters:

batch_size: 256 (ResNet-18/MobileNetV2), 128 (ResNet-50)
alpha_i: 1.6 (stride scaling factor)
number_of_batches: 16
+ 1 more
calibration_set_size: 1024 (implied from 16 batches * 64-ish size, specifically 16 batches used)

Compute: Nvidia Tesla A100 or Intel Xeon Platinum 8336C CPU. Inference time not explicitly reported, but PTQ typically takes minutes.

Comparison to Prior Work

vs. BRECQ: BRECQ uses fixed block definitions; MRECG dynamically merges blocks based on capacity analysis
vs. QDROP: MRECG is compatible with QDROP and adds the granularity optimization on top of the drop strategy
vs. AdaRound: MRECG optimizes larger, variable-sized scopes to handle error accumulation better than layer-wise approaches

Limitations

Data-dependent capacity metric requires a preliminary PTQ pass, adding computational overhead
Merged modules increase the memory/compute cost of the reconstruction step compared to single-layer optimization
Diminishing returns observed when expanding calibration batch size beyond a certain point

Reproducibility

Code: https://github.com/bytedance/MRECG

Code is publicly available on GitHub. Hyperparameters for reconstruction follow standard BRECQ/QDROP settings. Alpha parameter for stride scaling explicitly set to 1.6.

📊 Experiments & Results

Evaluation Setup

ImageNet classification accuracy after quantization

Benchmarks:

ImageNet (ILSVRC-2012) (Image Classification)

Metrics:

Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MobileNetV2 results show massive gains in low-bit (2/4bit) settings, validating the theory that capacity bottlenecks (common in MBV2) cause severe oscillation that MRECG fixes.
MobileNetV2x0.5 (ImageNet)	Top-1 Accuracy (2/4bit)	34.55	41.16	+6.61
MobileNetV2x1.0 (ImageNet)	Top-1 Accuracy (2/4bit)	52.30	58.49	+6.19
MobileNetV2x1.0 (ImageNet)	Top-1 Accuracy (3/3bit)	52.03	57.14	+5.11
ResNet results show consistent but smaller gains, likely because ResNet architectures have fewer severe capacity bottlenecks than MobileNetV2.
ResNet-18 (ImageNet)	Top-1 Accuracy (2/4bit)	63.71	65.61	+1.90
ResNet-50 (ImageNet)	Top-1 Accuracy (2/4bit)	68.55	70.04	+1.49
Ablation study demonstrates that both the MRECG strategy and expanding batch size contribute to the final performance.
MobileNetV2x0.5 (ImageNet)	Top-1 Accuracy (4/4bit)	50.83	51.91	+1.08

Experiment Figures

Comparison of reconstruction loss distribution between BRECQ and MRECG on MobileNetV2.

Pareto optimality plot of MRECG compared to random granularity schemes.

Main Takeaways

Loss oscillation is a critical issue in PTQ, especially for architectures with varying module capacities like MobileNetV2
Jointly optimizing capacity-mismatched modules effectively smooths the loss landscape and significantly recovers accuracy in low-bit (e.g., 2-bit) regimes
Expanding calibration batch size helps reduce approximation error, but with diminishing marginal utility
The method is generalized and can be applied on top of existing PTQ frameworks like BRECQ and QDROP for additive gains

📚 Prerequisite Knowledge

Prerequisites

Understanding of Post-Training Quantization (PTQ) workflows (calibration, reconstruction)
Knowledge of Taylor expansion for loss approximation
Familiarity with standard CNN architectures (ResNet, MobileNet)

Key Terms

PTQ: Post-Training Quantization—compressing a model after training using limited data, without full fine-tuning

ModCap: Module Capacity—a metric defined by the paper to quantify a layer's ability to represent information, based on parameters, bit-width, and stride

Oscillation: The phenomenon where reconstruction loss fluctuates (increases and decreases) significantly across layers/blocks during sequential quantization, rather than decreasing monotonically

Topological Homogeneity: A condition defined by the authors where two modules share hyperparameters (like stride/groups) except kernel size and channels, allowing their capacities to be compared

Mixed Reconstruction Granularity: The strategy of varying the size of the unit being optimized (e.g., merging two blocks) based on capacity differences

BRECQ: Block Reconstruction Quantization—a baseline PTQ method that optimizes reconstruction error within local blocks

QDROP: A PTQ method that randomly drops quantization during reconstruction to flatten the loss landscape

diminishing marginal utility: The economic concept applied here to batch size: adding more calibration data helps, but the benefit decreases as the batch size gets larger