Evaluation Setup
Pretrain from scratch -> Evaluate FP16 performance (Perplexity/GLUE) -> Quantize to W8A8 -> Evaluate W8A8 Perplexity
Benchmarks:
- GLUE (Natural Language Understanding (finetuning))
- Wiki-40b Validation (Language Modeling (Perplexity))
Metrics:
- MLM Accuracy
- Perplexity (PPL)
- GLUE Average Score
- Kurtosis (outlier metric)
- Infinity Norm (max activation value)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| BERT-base results showing NCS recovers some FP16 performance compared to CS but still lags vanilla, while maintaining excellent quantization properties. |
| GLUE (Avg) |
Score |
81.7 |
73.8 |
-7.9
|
| BERT-base W8A8 |
Perplexity |
4612.6 |
4.95 |
-4607.65
|
| OPT-125M results demonstrating that NCS fixes the failure of CS on causal language models. |
| OPT-125M W8A8 |
Perplexity |
21.18 |
18.33 |
-2.85
|
| OPT-125M |
Kurtosis |
1778.0 |
1104.5 |
-673.5
|
Main Takeaways
- Sequence length mismatch between pretraining and finetuning hurts performance of outlier-free models (CS); NCS mitigates this by normalizing invariantly to length.
- Outlier-free pretraining (NCS/CS) enables W8A8 quantization for BERT where vanilla models fail completely.
- For causal models (OPT), standard CS fails because token context lengths vary; NCS fixes this and achieves best-in-class W8A8 perplexity for OPT-125M.
- Scaling limitation: The method works for small models (<350M) but failed to generalize to OPT-350M in initial experiments.