Scaling Laws for Post Training Quantized Large Language Models

📝 Paper Summary

Post-Training Quantization (PTQ) Neural Scaling Laws Efficient Inference

This study establishes empirical scaling laws for post-training quantization and proposes a Random Forest regressor that accurately predicts the inference loss of quantized LLMs using features of the local loss landscape.

Core Problem

While pre-training follows predictable scaling laws, the quality of Large Language Models (LLMs) after post-training quantization (PTQ) is highly unpredictable and often requires expensive trial-and-error validation.

Why it matters:

PTQ introduces significant uncertainty, obscuring the return on investment for deploying compressed models
Finding the optimal quantization format and model size under fixed constraints is time and compute intensive due to the vast search space
Current approaches lack practical guidance on how model size, data format, and optimization algorithms interact to affect final model quality

Concrete Example: A developer attempting to compress opt-1.3b might find that GPTQ optimization greatly improves performance for the mxint3_128 format, but only marginally improves 6-bit, 4-bit, or 2-bit formats, a non-monotonic behavior that is difficult to predict without running the full process.

Key Novelty

Predictive Statistical Model for PTQ Quality

Identifies that larger models have systematically flatter local loss landscapes, which dictates their sensitivity to quantization noise
Constructs a statistical regressor (Random Forest) that inputs model properties (size, pre-trained loss) and quantization specifics (SQNR, format) to predict the final quantized loss without full evaluation
Establishes a 'Pareto frontier' quantifying the trade-off between model size and bit precision across multiple LLM families

Architecture

Visualization of scaling laws (Left) and the typical local loss landscape (Right).

Evaluation Highlights

Demonstrates that the Random Forest model can predict post-quantization loss for unseen model families (Pythia-1b, MPT-7b) using scaling laws derived from separate families (GPT-2, Llama, etc.)
Identifies a specific Signal-to-Noise Ratio (SNR) window (~20 dB) where the GPTQ algorithm is most effective due to the 'step-like' nature of radial loss profiles
Validates scaling laws across 5 diverse LLM families (GPT-2, OPT, BLOOM, Llama 2, Llama 3) and 36 distinct Microscaling (MX) data formats

Breakthrough Assessment

7/10

Provides a significant methodological advance by turning PTQ from a trial-and-error process into a predictable one governed by scaling laws, though specific performance improvements on benchmarks are not the primary focus.

⚙️ Technical Details

Problem Definition

Setting: Predicting the Negative Log-Likelihood (NLL) of a quantized LLM based on its pre-trained characteristics and quantization parameters

Inputs: Pre-trained weights w, Quantizer Q, target Data Format params (precision P, block size K)

Outputs: Predicted NLL of the quantized model NLL(Q(w*))

Pipeline Flow

Feature Extraction (Compute NLL, SQNR, Gradient/Hessian properties of pre-trained model)
RTN Quantization (Initial perturbation)
GPTQ Optimization (Optional iterative refinement using inverse Hessian)
Performance Prediction (Random Forest predicts final NLL)

System Modules

Feature Extractor

Measures intrinsic model properties and extrinsic quantization stats

Predictive Model

Predicts the final quantized loss using extracted features

Model or implementation: Random Forest Regressor (120 estimators, depth 8)

Modeling

Base Model: Random Forest Regressor (used to predict LLM performance)

Training Method: Supervised Regression on collected scaling data

Key Hyperparameters:

n_estimators: 120
max_depth: 8
gptq_dampening_factor_grid: {10^-3, ..., 10^4}
+ 1 more
calibration_sequences: 128

Compute: Not reported in the paper

Comparison to Prior Work

vs. Kaplan et al.: This work extends scaling laws to the *post-training quantization* phase, incorporating extrinsic factors like data format and PTQ algorithms

Limitations

The predictive model relies on extracting Hessian/loss landscape features which requires computation (backward passes)
Study focuses on specific families (GPT-2, OPT, etc.) and may need recalibration for architectures with vastly different loss landscapes
Validates mainly on WikiText-2 NLL; downstream task performance scaling is not explicitly explored in depth

Reproducibility

Code availability is not provided. The paper details the specific features and hyperparameters (120 estimators, depth 8) for the Random Forest model. 5 LLM families were used for training the predictor, and 2 held-out families (Pythia, MPT) for testing.

📊 Experiments & Results

Evaluation Setup

Evaluation of quantized LLMs on next-token prediction

Benchmarks:

WikiText-2 (Language Modeling (Next-Token Prediction))

Metrics:

Negative Log-Likelihood (NLL)
Signal-to-Quantization-Noise Ratio (SQNR)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of RTN vs GPTQ performance on OPT models and radial loss profiles.

Scatter plot of Predicted NLL vs True NLL for held-out models (Pythia-1b, MPT-7b).

Main Takeaways

Empirical Scaling: Quantization quality follows predictable scaling laws based on model size, data format, and local loss landscape flatness.
Loss Landscape: Larger models have systematically flatter local loss landscapes, making them generally more robust to quantization noise.
Predictability: A Random Forest regressor can accurately predict the post-GPTQ loss of unseen model families (Pythia-1b, MPT-7b) using features like RTN SQNR and radial slope.
GPTQ Effectiveness: The benefit of GPTQ is non-monotonic and highly dependent on whether the quantization noise falls within a specific 'step-like' region of the loss landscape (typically < 20 dB SQNR).
Pareto Frontier: The study establishes a Pareto frontier for quantization, helping determine the optimal trade-off between model size and bit precision.

📚 Prerequisite Knowledge

Prerequisites

Post-Training Quantization (PTQ)
Signal-to-Noise Ratio (SNR)
Hessian Matrix and Loss Landscapes
Random Forest Regression

Key Terms

PTQ: Post-Training Quantization—compressing a trained model's weights to lower precision without re-training from scratch

SQNR: Signal-to-Quantization-Noise Ratio—a measure (in dB) of the relative magnitude of quantization error compared to the original weight magnitude

GPTQ: Generative Pre-trained Transformer Quantization—an algorithm that compresses LLM weights by minimizing layer-wise reconstruction error using second-order (Hessian) information

RTN: Round-to-Nearest—a simple quantization baseline that rounds weights to the nearest representable value in the target format

Microscaling (MX): A data format specification (e.g., OCP Microscaling) where blocks of elements share a common scale factor to allow efficient low-precision representation

NLL: Negative Log-Likelihood—a loss metric used to evaluate the quality of language model predictions (lower is better)

Pareto frontier: The set of optimal trade-offs where no improvement in one metric (e.g., model size) is possible without degrading another (e.g., loss)