Training Dynamics Impact Post-Training Quantization Robustness

📝 Paper Summary

Post-training quantization (PTQ) Training dynamics of Large Language Models

The degradation of Large Language Models during post-training quantization is driven by sharp learning rate decays and loss landscape curvature, rather than dataset size or training duration.

Core Problem

Existing studies claim that training LLMs on larger datasets inherently increases their susceptibility to quantization errors, suggesting a conflict between model scaling and deployment efficiency.

Why it matters:

Discourages training models for longer durations if they are intended for low-bit deployment
Current scaling laws for quantization (e.g., Kumar et al., 2024) may be confounding training duration with optimization dynamics (learning rate decay)
Quantization brittleness limits the deployment of high-performance models on commodity hardware

Concrete Example: In the SmolLM3 training run, quantization error remains low and stable for 11 trillion tokens of training, but spikes abruptly only when the learning rate begins to decay, contradicting the idea that token count alone drives degradation.

Key Novelty

Learning Rate Decay as the Driver of Quantization Brittleness

Disentangles 'training duration' from 'optimization dynamics' by showing that quantization error spikes specifically during the learning rate annealing phase (decay), not simply as more data is seen
Demonstrates that maintaining a higher learning rate (via Warmup-Stable-Decay schedules) keeps quantization error low compared to Cosine schedules
Identifies 'Model Soups' (weight averaging) as a technique to reverse quantization degradation, yielding lower errors than individual checkpoints

Architecture

Trajectories of Quantization Error and Validation Loss overlayed with Learning Rate schedule for SmolLM3.

Evaluation Highlights

Comprehensive analysis of quantization error across training trajectories of 6 open-source model families (up to 32B parameters and 15T training tokens)
Controlled experiments up to 100B tokens isolate learning rate effects, refuting prior claims that data scale inherently causes degradation
Demonstrates that Model Soups (averaging checkpoints) consistently reduce post-training quantization error compared to constituent models in OLMo2 and SmolLM3 families

Breakthrough Assessment

8/10

Significantly challenges the prevailing wisdom on quantization scaling laws. By identifying learning rate schedules as the true cause of brittleness, it offers actionable recipes (WSD schedules, soups) to improve deployment efficiency.

⚙️ Technical Details

Problem Definition

Setting: Post-Training Quantization (PTQ) of Pre-trained Large Language Models

Inputs: High-precision model weights W, Calibration dataset X

Outputs: Quantized weights W_Q that minimize the reconstruction error ||XW^T - XW_Q^T|| or cross-entropy loss degradation

Pipeline Flow

Pre-training (High Precision) -> Checkpoint Selection -> Post-Training Quantization -> Evaluation

System Modules

Pre-training / Training Analysis

Train models while tracking checkpoints or analyze existing open-source trajectories

Model or implementation: Various (OLMo, SmolLM3, Custom Transformers)

Quantization

Apply GPTQ to compress weights to 3-bit or 4-bit precision

Model or implementation: GPTQ Algorithm

Modeling

Base Model: Analyzed: OLMo (1B, 7B), OLMo2 (1B-32B), SmolLM3 (3B), Apertus (8B), Amber (7B), Open-science (1.3B). Custom: Transformers trained on FineWebEdu.

Training Method: Analysis of Pre-training Dynamics

Training Data:

FineWebEdu (for controlled experiments)
Various pre-training corpora for analyzed open-source models (up to 15T tokens)

Key Hyperparameters:

learning_rate_schedule: Analyzed both Cosine and Warmup-Stable-Decay (WSD)
quantization_precision: 3-bit and 4-bit
quantization_method: GPTQ (primary), AWQ/BNB (appendix)

Compute: Not reported in the paper

Reproducibility

The paper uses open-source models (OLMo, SmolLM3, etc.) and open datasets (FineWebEdu). The analysis code is not provided in the text snippet.

📊 Experiments & Results

Evaluation Setup

Quantize intermediate training checkpoints and measure performance degradation relative to the unquantized model.

Benchmarks:

Validation Loss (Language Modeling)
12 Standard Benchmarks (Downstream Tasks (ARC, HellaSwag, MMLU, etc.))

Metrics:

Relative Cross-Entropy Loss ( (CE_quant / CE_orig) - 1 )
Relative Accuracy Drop ( (Acc_orig - Acc_quant) / (1 - Acc_orig) )
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Controlled experiments comparing quantization error vs. validation loss for different token budgets under specific decay schedules.

Replication of scaling laws using WSD vs Cosine schedules.

Main Takeaways

Quantization error trajectories diverge from validation loss curves specifically when the learning rate decays; during stable high-LR phases, quantization error remains flat even as tokens increase.
Controlled experiments with WSD (Warmup-Stable-Decay) schedules show that models can be trained for longer (more tokens) without increasing quantization error, provided the learning rate is kept high.
This refutes previous scaling laws (Kumar et al., 2024; Ouyang et al., 2024) which posited that data scale itself causes brittleness, suggesting those results were confounded by cosine decay schedules.
Weight averaging (Model Soups) is highly effective for robustness: a soup of checkpoints often has lower quantization error than any individual ingredient.
Post-pretraining stages affect robustness differently: Context Extension improves robustness, while Mid-Training amplifies error. Alignment (SFT/APO) generally reduces degradation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Post-Training Quantization (PTQ) methods like GPTQ
Familiarity with LLM training schedules (Cosine vs. Warmup-Stable-Decay)
Basic concepts of loss landscapes (flat vs. sharp minima)

Key Terms

PTQ: Post-Training Quantization—converting a trained model to lower precision (e.g., 4-bit) without full retraining

GPTQ: Generative Pre-trained Transformer Quantization—a specific algorithm for compressing LLM weights by minimizing layer-wise reconstruction error

WSD: Warmup-Stable-Decay—a learning rate schedule with a long constant phase followed by a rapid decay, contrasted with continuous cosine decay

Model Soup: The technique of averaging the weights of multiple fine-tuned or checkpointed models to improve performance and robustness

SFT: Supervised Fine-Tuning—training phase using labeled data to adapt a pre-trained model to specific tasks

APO: Anchored Preference Optimization—an alignment technique used in SmolLM3 to improve model helpfulness

Loss Curvature: The geometry of the loss function; 'sharper' minima imply that small perturbations (like quantization noise) cause large increases in error