Is It a Free Lunch for Removing Outliers during Pretraining?

📝 Paper Summary

LLM Quantization Pretraining optimization Outlier suppression

Outlier-free pretraining using clipped softmax degrades full-precision performance due to sequence length mismatch; a normalized variant (NCS) restores performance and enables effective pretraining for causal models like OPT.

Core Problem

Pretraining with 'clipped softmax' successfully removes outliers for quantization but degrades full-precision (FP16) performance and fails on causal LLMs.

Why it matters:

Outliers in activations/weights are the main bottleneck for quantizing LLMs to low bit-widths (e.g., 8-bit or 4-bit) without accuracy loss
Existing outlier-free pretraining methods (like Clipped Softmax) make models quantization-friendly but hurt their standard performance on downstream tasks
Standard Clipped Softmax is sensitive to sequence length, causing a mismatch between pretraining (fixed length) and inference/finetuning (variable length)

Concrete Example: A BERT model pretrained with Clipped Softmax (CS) sees a massive drop in GLUE scores (avg 68.1) compared to a vanilla BERT (avg 81.7) because the CS normalization depends on sequence length, distorting attention probabilities when lengths change during finetuning.

Key Novelty

Normalized Clipped Softmax (NCS)

Modifies the clipped softmax function to use a normalization term that is invariant to sequence length, unlike the original method where normalization fluctuated with length
Ensures the product of attention probability and value matrix remains consistent between pretraining and downstream tasks with varying sequence lengths
Adapts the normalization for causal attention (OPT) by accounting for the varying context length of tokens within the lower-triangular attention mask

Architecture

Mathematical formulation of the Clipped Softmax (CS) and Normalized Clipped Softmax (NCS)

Evaluation Highlights

Recovered FP16 performance: NCS improves average GLUE score from 68.1 (CS) to 73.8, significantly closing the gap with vanilla BERT (81.7)
Successful causal model quantization: On OPT-125M, NCS achieves W8A8 perplexity of 18.33, beating both Vanilla (21.18) and standard CS (37.20)
Reduced sensitivity: NCS maintains consistent pretraining performance across different maximum sequence lengths (64 to 256), whereas standard CS fluctuates

Breakthrough Assessment

4/10

Identifies a critical flaw in prior outlier-free pretraining (length sensitivity) and proposes a fix. While it improves quantization for causal models, it still lags behind vanilla models in full-precision finetuning.

⚙️ Technical Details

Problem Definition

Setting: Pretraining Transformer models (BERT, OPT) with modified attention mechanisms to prevent outlier formation in activations/weights

Inputs: Token sequences for MLM (BERT) or Next Token Prediction (OPT)

Outputs: Quantization-friendly pretrained checkpoints

Pipeline Flow

Input Embedding
Transformer Layers with NCS Attention
Output Head

System Modules

NCS Attention

Compute attention probabilities using length-invariant normalization to prevent outliers

Model or implementation: Modified Self-Attention

Novel Architectural Elements

Normalized Clipped Softmax (NCS): Enforces a constant sum for probability normalization regardless of sequence length T (Equation 3)

Modeling

Base Model: BERT (Small/Base/Large) and OPT (125M/350M)

Training Method: Pretraining from scratch

Objective Functions:

Purpose: Train language models while suppressing outlier formation.

Formally: Standard MLM (BERT) or Causal LM loss (OPT) using NCS attention.

Training Data:

BookCorpus and English Wikipedia (clean Wiki-40b subset)
Packed sequences with specific max lengths (e.g., 128 for BERT-base)

Key Hyperparameters:

bert_learning_rate: 7e-4 (Small/Base), 1e-4 (Large)
opt_learning_rate: 4e-4
batch_size: 2048 (BERT), 192/256 (OPT)
+ 2 more
NCS_zeta: 1 (from CS optimal)
NCS_beta: -2.175 (BERT), 0.9 (OPT)

Compute: Single A6000 48GB GPU

Comparison to Prior Work

vs. CS: NCS uses fixed normalization constant β to handle variable sequence lengths, improving FP16 transfer and causal modeling
vs. Vanilla: NCS explicitly removes outliers during training, enabling W8A8 without complex PTQ tricks

Limitations

Negative results on scaling: OPT-350M with NCS performed worse than vanilla in W8A8, suggesting issues scaling to larger causal models
Gap in FP16: While improved over CS, NCS still underperforms Vanilla BERT on GLUE (73.8 vs 81.7)
Strict quantization setting: Evaluated on stricter W8A8 (including embeddings/norm) than typical PTQ papers, making direct comparison difficult

Reproducibility

Code: https://github.com/Qualcomm-AI-research/outlier-free-transformers

Code for the baseline (CS) is linked. The paper provides exact hyperparameters for NCS (derived from CS). Pretraining is resource-intensive but experimental setup is standard (BERT/OPT on Wiki/BookCorpus).

📊 Experiments & Results

Evaluation Setup

Pretrain from scratch -> Evaluate FP16 performance (Perplexity/GLUE) -> Quantize to W8A8 -> Evaluate W8A8 Perplexity

Benchmarks:

GLUE (Natural Language Understanding (finetuning))
Wiki-40b Validation (Language Modeling (Perplexity))

Metrics:

MLM Accuracy
Perplexity (PPL)
GLUE Average Score
Kurtosis (outlier metric)
Infinity Norm (max activation value)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
BERT-base results showing NCS recovers some FP16 performance compared to CS but still lags vanilla, while maintaining excellent quantization properties.
GLUE (Avg)	Score	81.7	73.8	-7.9
BERT-base W8A8	Perplexity	4612.6	4.95	-4607.65
OPT-125M results demonstrating that NCS fixes the failure of CS on causal language models.
OPT-125M W8A8	Perplexity	21.18	18.33	-2.85
OPT-125M	Kurtosis	1778.0	1104.5	-673.5

Experiment Figures

Left: GLUE scores for BERT-base. Right: MLM accuracy vs Sequence Length.

Sensitivity of BERT-small pretraining to max sequence length (64, 128, 256).

Main Takeaways

Sequence length mismatch between pretraining and finetuning hurts performance of outlier-free models (CS); NCS mitigates this by normalizing invariantly to length.
Outlier-free pretraining (NCS/CS) enables W8A8 quantization for BERT where vanilla models fail completely.
For causal models (OPT), standard CS fails because token context lengths vary; NCS fixes this and achieves best-in-class W8A8 perplexity for OPT-125M.
Scaling limitation: The method works for small models (<350M) but failed to generalize to OPT-350M in initial experiments.

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanism (Softmax)
Post-training quantization (PTQ) basics
Outliers in LLM activations

Key Terms

CS: Clipped Softmax—a modified softmax proposed by Bondarenko et al. (2023) to limit probability magnitude and prevent outliers

NCS: Normalized Clipped Softmax—the proposed method that makes the normalization constant independent of sequence length

no-op: A phenomenon where attention heads allocate probability to specific tokens (like [SEP]) to avoid updating the representation, often leading to outliers

W8A8: Quantization setting with 8-bit weights and 8-bit activations

kurtosis: A measure of the 'tailedness' of a distribution; high kurtosis in activations indicates severe outliers

MLM: Masked Language Modeling—pretraining objective for BERT

GLUE: General Language Understanding Evaluation—benchmark for NLU tasks