Pre-training Distillation for Large Language Models: A Design Space Exploration

📝 Paper Summary

Large Language Model Pre-training Knowledge Distillation Model Compression

Pre-training distillation (transferring knowledge from a teacher LLM to a student LLM during pre-training via logits) consistently improves performance compared to standard language modeling, with optimal gains achieved by specific logits truncation and loss scheduling strategies.

Core Problem

Standard pre-training of smaller LLMs relies solely on hard labels (next-token prediction), missing the rich semantic information available in the probability distributions of larger, more capable teacher models.

Why it matters:

Training smaller, efficient LLMs is crucial for deployment, but they often lack the reasoning capabilities of larger models.
Post-training distillation is common, but applying distillation during the expensive and critical pre-training phase is underexplored due to massive data and computational costs.
Storing full logits for pre-training corpora (trillions of tokens) is storage-prohibitive (petabytes), requiring efficient compression techniques.

Concrete Example: Storing full float32 logits for a 150k vocabulary over 100B tokens would require ~58.6 PB of disk space. Without efficient truncation (like Top-p-K), pre-training distillation is practically impossible due to storage constraints.

Key Novelty

Systematic Design Space Exploration for Pre-training Distillation (PD)

Proposes a 'Pre-training Distillation' (PD) framework where a student LLM learns from a teacher's logits on massive unlabeled corpora, not just instruction data.
Identifies efficient logits storage techniques (Top-p + Top-k truncation) that reduce storage by ~4000x without hurting performance.
Discovers that a dynamic mixture of distillation loss and standard language modeling loss (Warmup-Stable-Decay schedule) outperforms static mixing.

Architecture

Comparison of validation loss curves and downstream performance between Baseline (LLM-LM) and Pre-training Distillation (PD/LLM-KD).

Evaluation Highlights

+1.6% average improvement across 8 benchmarks (e.g., MMLU, GSM8k) for a 1.9B student distilled from GLM-4-9B on 100B tokens compared to standard pre-training.
Efficient logits truncation (Top-p=0.95 followed by Top-k=100) reduces storage by 4,000x (58.6 PB → 15 TB) while maintaining distillation benefits.
+8.0% improvement in average score using a Warmup-Stable-Decay (WSD) scheduler for the distillation loss weight compared to the baseline LM pre-training.

Breakthrough Assessment

7/10

Provides the first comprehensive empirical study on pre-training distillation for LLMs, offering practical recipes for scaling (storage reduction, loss scheduling). While the fundamental concept of KD is old, the application to LLM pre-training scale is significant.

⚙️ Technical Details

Problem Definition

Setting: Pre-training a student LLM on a large corpus where each token has both a ground truth next-token label and a teacher model's logit distribution.

Inputs: Input text sequence x, Teacher LLM logits P_T(x), Ground truth token labels

Outputs: Trained Student LLM parameters theta_S

Pipeline Flow

Teacher Inference (GLM-4-9B generates logits for corpus)
Logits Processing (Top-p + Top-k truncation & normalization)
Student Training (1.9B model trains on text + processed logits)

System Modules

Teacher Model

Generate soft targets (probability distributions) for the pre-training corpus

Model or implementation: GLM-4-9B (or 32B in scaling experiments)

Logits Processor

Compress logits to manageable size for storage and streaming

Model or implementation: Deterministic algorithm

Student Model

Learn from both hard labels (text) and soft labels (teacher logits)

Model or implementation: 1.9B - 6.8B parameter dense transformers

Novel Architectural Elements

Two-stage Top-p-k logits truncation specifically optimized for massive LLM vocabulary storage reduction

Modeling

Base Model: Custom Transformer models (1.9B, 3.8B, etc.) initialized from scratch

Training Method: Pre-training with Knowledge Distillation (KD)

Objective Functions:

Purpose: Standard language modeling.

Formally: L_lm = -log P_S(x_t | x_<t)
Purpose: Align student distribution with teacher.

Formally: L_kd = KL(P_T || P_S) or -sum(P_T * log P_S)
Purpose: Combined objective.

Formally: L = (1-alpha) * L_lm + alpha * L_kd

Training Data:

100B to 500B tokens of pre-training data
Subsampled from larger internal corpora

Key Hyperparameters:

learning_rate: 6e-4 (max), 6e-5 (min)
batch_size: 2,048
max_sequence_length: 4,096
+ 5 more
warmup_rate: 1%
teacher_logit_temperature: 1.0 (standard), checked 0.05-10.0
alpha: Variable (0.9 best for static, WSD schedule best for dynamic)
top_p: 0.95
top_k: 100

Compute: Teacher logits generation requires substantial inference; storage requires ~15TB disk space for 100B tokens (compressed)

Comparison to Prior Work

vs. Post-training KD (MiniLLM/GKD): Focuses on the pre-training phase with massive unlabeled corpora rather than limited instruction data
vs. AFM [not cited in paper]: AFM uses top-k=1 (hard labels from teacher); this paper finds top-k=50/100 significantly better
vs. Standard Pre-training: Adds teacher logits as supervision signal

Limitations

Storage overhead for logits is still significant (TB scale) even with compression
Offline logits approach requires double computation (teacher inference + student training)
Online logits from a non-converged teacher (training simultaneously) perform poorly compared to converged teacher logits
Experiments limited to 100B-500B tokens, not full trillion-scale training

Reproducibility

Not provided (code url absent). Uses internal GLM-4-9B model as teacher. Pre-training data is proprietary/internal subset.

📊 Experiments & Results

Evaluation Setup

Pre-train 1.9B/3.8B models on 100B/500B tokens, then SFT on 10B tokens, then evaluate zero-shot.

Benchmarks:

MMLU (English Understanding & Reasoning)
C-Eval (Chinese Understanding)
GSM8k (Math Reasoning)
HellaSwag (Commonsense Reasoning)

Metrics:

Accuracy (Zero-shot)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows PD consistently outperforms standard LM pre-training across diverse benchmarks.
Average (8 datasets)	Accuracy	37.7	38.3	+0.6
GSM8k	Accuracy	8.6	10.8	+2.2
C-Eval	Accuracy	25.9	26.7	+0.8
Loss scheduling ablations reveal that Warmup-Stable-Decay (WSD) for the distillation weight alpha yields the largest gains.
Average (8 datasets)	Accuracy	37.7	40.7	+3.0
Scaling analysis shows larger students benefit more, but larger teachers don't always help.
Relative Improvement	% Improvement	0.1	1.7	+1.6

Experiment Figures

Scaling law analysis: Relative improvement of PD over LM baseline across different student model sizes (330M to 6.8B) and teacher sizes (9B vs 32B).

Accuracy curves over training tokens (0-500B) for 1.9B and 3.8B models.

Main Takeaways

Top-p-k truncation (p=0.95, k=100) effectively compresses logits by 4000x without degrading distillation performance compared to sharper truncations.
MSE loss performs significantly worse than KL Divergence or NLL for LLM pre-training distillation.
Dynamic loss scheduling (WSD) is crucial: maintaining high KD weight while learning rate is high, then decaying, yields the best results.
Online distillation (using logits from a simultaneously training teacher) is less effective than offline (converged teacher) but still provides gains over baseline LM training.
Capacity gap matters: A 32B teacher did not outperform a 9B teacher when distilling into a 1.9B student, suggesting extremely large teachers may not be optimal for small students.

📚 Prerequisite Knowledge

Prerequisites

Language Modeling (Next-token prediction)
Knowledge Distillation (KD)
Kullback–Leibler (KL) Divergence

Key Terms

PD: Pre-training Distillation—applying knowledge distillation during the large-scale pre-training phase of an LLM using teacher logits

Logits: The raw, unnormalized prediction scores generated by the final layer of a neural network before the softmax function

Top-p-k truncation: A two-stage compression method: first keeping the smallest set of tokens whose cumulative probability exceeds p, then keeping only the top k of those

WSD: Warmup-Stable-Decay—a learning rate or loss weight schedule that warms up, stays constant, and then decays

Offline logits: Logits generated by a pre-trained teacher model and stored on disk before student training begins

Online logits: Logits generated on-the-fly by a teacher model that is being trained or run simultaneously with the student

SFT: Supervised Fine-Tuning—training on high-quality instruction-response pairs, used here to evaluate the pre-trained base models