FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization

📝 Paper Summary

Post-Training Quantization (PTQ) Neural Network Compression Large Language Model Quantization

FlexRound optimizes post-training quantization by using element-wise division for rounding, which leverages the reciprocal rule of derivatives to automatically assign more flexible quantization grids to larger, more important weights.

Core Problem

Existing PTQ rounding schemes (like AdaRound) rely on element-wise addition, which restricts weights to be rounded only to their nearest neighbors and fails to inherently prioritize updating important high-magnitude weights.

Why it matters:

Quantization-Aware Training (QAT) is resource-intensive and requires full datasets, which may be unavailable due to privacy or legacy issues, making effective PTQ crucial.
Current PTQ methods suffer performance degradation on compact architectures (MobileNetV2) or high-bit settings because they enforce rigid rounding constraints.
Weights with larger magnitudes are typically more important for network performance, but standard rounding-to-nearest treats all weights equally regardless of their impact.

Concrete Example: In MobileNetV2, weights with absolute values larger than 1.0 are critical. Standard adaptive rounding restricts these weights to the two nearest discrete values. FlexRound allows 12.8% of weights in the first convolutional block to shift to grids further away, preserving accuracy where rigid rounding fails.

Key Novelty

Element-wise Division-based Rounding

Replaces the standard additive rounding formulation with a division-based one ($W/S$), allowing the system to jointly learn a common grid size and individual weight scales.
Exploits the reciprocal rule of derivatives: the gradient for the scale parameter is proportional to the weight's magnitude, causing larger (more important) weights to receive larger updates and more flexible rounding.

Architecture

Overview of the FlexRound procedure within the PTQ pipeline.

Evaluation Highlights

12.8% of weights in the first block of MobileNetV2 are rounded 'aggressively' (deviating more than one grid step), demonstrating high flexibility for sensitive, compact models.
Only 1.5% of weights in ResNet-18 are rounded aggressively, showing the method inherently adapts to the model's tolerance (ResNet is less sensitive than MobileNet).
Demonstrates negligible performance impact on LLaMA compared to half-precision baselines using block-by-block reconstruction (qualitative result from abstract).

Breakthrough Assessment

7/10

Proposes a mathematically motivated change (division vs. addition) that intuitively aligns gradient magnitude with weight importance. It addresses a specific limitation in PTQ rigidity, particularly for LLMs and compact models.

⚙️ Technical Details

Problem Definition

Setting: Per-tensor uniform Post-Training Quantization (PTQ) minimizing block-wise reconstruction error.

Inputs: Full-precision pre-trained weights W and a small calibration dataset.

Outputs: Quantized weights $\widehat{W}$ and quantization scales.

Pipeline Flow

Input: Pre-trained Weight W
Scaling (Division): W is element-wise divided by learnable scale tensor S
Rounding: The result is rounded to integer grid points
Rescaling (Multiplication): The integer values are multiplied back by S to approximate W

System Modules

Learnable Scale Tensor S

Defines the quantization grid and individual weight scaling factors.

Rounding Function

Discretizes the scaled continuous weights.

Reconstruction Loss

Optimizes S to match the output of the full-precision layer/block.

Novel Architectural Elements

Replacing the additive rounding perturbation ($W + V$) with a divisive scaling formulation ($W / S$) to induce weight-magnitude-dependent gradient updates.

Modeling

Base Model: Evaluated on ResNet-18, MobileNetV2, BERT, GPT-Neo, OPT, GPT-2, LLaMA.

Training Method: Block-wise output reconstruction (PTQ)

Objective Functions:

Purpose: Minimize the difference between full-precision and quantized feature maps.

Formally: $\| WX - \widehat{W}\widetilde{X} \|_F^2$

Key Hyperparameters:

initialization: All learnable scale elements ($s_1, S_2, s_3, s_4$) are initialized to 1 or values derived from standard rounding-to-nearest.

Comparison to Prior Work

vs. AdaRound: FlexRound uses division ($W/S$) instead of addition, allowing weights to shift beyond the immediate nearest neighbors if the weight magnitude is large.
vs. AdaQuant: AdaQuant minimizes error via additive updates; FlexRound learns the grid size and local scales jointly via division.
vs. SmoothQuant: FlexRound does not assume specific outlier patterns in activations and is a general rounding scheme.

Limitations

Computational cost of learning per-weight scales ($S_2$) adds overhead during the PTQ optimization phase compared to simple RTN.
Requires access to a small calibration dataset (standard for PTQ, but a constraint over data-free methods).
Accuracy tables for specific benchmarks (ImageNet, WikiText) are not present in the provided text snippet, limiting verification of the 'negligible impact' claim.

Reproducibility

Code: https://github.com/onliwad101/FlexRound_LRQ

Code is publicly available. The paper defines the exact mathematical formulation for the scale tensor decomposition ($S = s_1 \odot S_2 \odot s_3 \odot s_4$).

📊 Experiments & Results

Evaluation Setup

Per-tensor uniform PTQ reconstruction minimizing block-wise error.

Benchmarks:

ImageNet (Image Classification)
WikiText (Language Modeling)

Metrics:

Reconstruction Error (MSE)
Percentage of aggressively rounded weights
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of weight rounding behavior reveals that FlexRound adapts its strategy based on model sensitivity and weight magnitude.
MobileNetV2 (Block 1, Conv 1)	Aggressively rounded weights (%)	0	12.8	+12.8
ResNet-18 (Block 1, Conv 1)	Aggressively rounded weights (%)	0	1.5	+1.5

Experiment Figures

Histograms of weight updates and scatter plots of grid shifts for MobileNetV2 vs. ResNet-18.

Main Takeaways

FlexRound inherently adapts to the magnitude of pre-trained weights: larger weights receive larger gradient updates and are quantized more flexibly.
The method generalizes across diverse architectures (Vision and LLMs) without architecture-specific heuristics like outlier patterns.
Flexibility increases with bit-width: unlike AdaRound which is binary (up/down), FlexRound explores a wider search space as bit-width increases.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Neural Network Quantization (INT4/INT8)
Calculus (Chain rule, derivatives)
Matrix operations (Frobenius norm, element-wise operations)

Key Terms

PTQ: Post-Training Quantization—quantizing a model after it has been trained, without requiring a full re-training process.

QAT: Quantization-Aware Training—simulating quantization during the training process to adapt weights, usually requiring the full dataset.

Rounding-to-nearest (RTN): The standard approach of rounding a continuous value to the closest integer grid point.

AdaRound: A state-of-the-art PTQ method that learns whether to round weights up or down (using addition), but is restricted to the two nearest grids.

Element-wise division: Dividing each element of a matrix by the corresponding element of another matrix (or a broadcasted scalar).

Frobenius norm: The square root of the sum of the absolute squares of the elements of a matrix, used here as a distance metric for reconstruction error.

Block-wise reconstruction: Optimizing quantization parameters to minimize the error between the output of a quantized block of layers and the original full-precision block.