FP8: 8-bit Floating Point format. Variations include E5M2 (5 exponent bits), E4M3 (4 exponent bits), and E3M4 (3 exponent bits).
E5M2: An FP8 format with 5 exponent bits and 2 mantissa bits, offering high dynamic range but lower precision; similar to FP16.
E4M3: An FP8 format with 4 exponent bits and 3 mantissa bits, offering a balance of range and precision.
E3M4: An FP8 format with 3 exponent bits and 4 mantissa bits, offering higher precision but limited dynamic range.
Post-training Quantization (PTQ): Compressing a model after training without retraining it, usually using a small calibration dataset.
Mantissa: The part of a floating-point number that represents significant digits (precision).
Dynamic Range: The ratio between the largest and smallest non-zero values a format can represent.
LayerNorm: Layer Normalization, a technique to stabilize training, known to produce large outliers in LLMs.
Calibration: The process of estimating the range of activation values (e.g., min/max) to determine scaling factors for quantization.
Per-channel scaling: Assigning a separate scaling factor to each channel of a weight tensor to minimize error.
FID: Fréchet Inception Distance, a metric used to assess the quality of images generated by generative models (lower is better).