Evaluation Setup
Quantize intermediate training checkpoints and measure performance degradation relative to the unquantized model.
Benchmarks:
- Validation Loss (Language Modeling)
- 12 Standard Benchmarks (Downstream Tasks (ARC, HellaSwag, MMLU, etc.))
Metrics:
- Relative Cross-Entropy Loss ( (CE_quant / CE_orig) - 1 )
- Relative Accuracy Drop ( (Acc_orig - Acc_quant) / (1 - Acc_orig) )
- Statistical methodology: Not explicitly reported in the paper
Main Takeaways
- Quantization error trajectories diverge from validation loss curves specifically when the learning rate decays; during stable high-LR phases, quantization error remains flat even as tokens increase.
- Controlled experiments with WSD (Warmup-Stable-Decay) schedules show that models can be trained for longer (more tokens) without increasing quantization error, provided the learning rate is kept high.
- This refutes previous scaling laws (Kumar et al., 2024; Ouyang et al., 2024) which posited that data scale itself causes brittleness, suggesting those results were confounded by cosine decay schedules.
- Weight averaging (Model Soups) is highly effective for robustness: a soup of checkpoints often has lower quantization error than any individual ingredient.
- Post-pretraining stages affect robustness differently: Context Extension improves robustness, while Mid-Training amplifies error. Alignment (SFT/APO) generally reduces degradation.