PD: Pre-training Distillation—applying knowledge distillation during the large-scale pre-training phase of an LLM using teacher logits
Logits: The raw, unnormalized prediction scores generated by the final layer of a neural network before the softmax function
Top-p-k truncation: A two-stage compression method: first keeping the smallest set of tokens whose cumulative probability exceeds p, then keeping only the top k of those
WSD: Warmup-Stable-Decay—a learning rate or loss weight schedule that warms up, stays constant, and then decays
Offline logits: Logits generated by a pre-trained teacher model and stored on disk before student training begins
Online logits: Logits generated on-the-fly by a teacher model that is being trained or run simultaneously with the student
SFT: Supervised Fine-Tuning—training on high-quality instruction-response pairs, used here to evaluate the pre-trained base models