Memorization Dynamics in Knowledge Distillation for Language Models

📝 Paper Summary

LLM Knowledge Distillation Privacy and Memorization in LLMs

Knowledge distillation acts as a strong regularizer that significantly reduces training data memorization compared to standard fine-tuning, while preferentially retaining 'easy-to-memorize' examples with low entropy.

Core Problem

While memorization is well-studied in standard pre-training and fine-tuning, its dynamics during knowledge distillation—where a student mimics a teacher's distribution—are poorly understood.

Why it matters:

Large teacher models inevitably memorize training data, raising concerns that distilled students might inherit sensitive information or privacy vulnerabilities
Distillation is often cited as a privacy defense, but the extent of data leakage remains unquantified in modern LLM settings
Training data extraction attacks can recover proprietary or private data, so understanding defense mechanisms is crucial for safe deployment

Concrete Example: A baseline 1.4B model fine-tuned on FineWeb memorizes 1,698 specific training examples. When the same 1.4B model is trained via distillation from a 12B teacher on the same data, it memorizes only ~700 examples, rejecting over 50% of the memorization risks.

Key Novelty

Distillation as a Memorization Filter

Demonstrates that minimizing KL divergence (soft targets) acts as a regularizer, preventing the student from over-fitting to specific training examples compared to cross-entropy (hard targets)
Identifies 'easy-to-memorize' examples (low zlib entropy, low perplexity) that are deterministically memorized across models, while harder examples are filtered out by distillation
Proposes a pre-distillation classifier that uses teacher/baseline statistics to predict and remove high-risk memorization candidates before training begins

Architecture

Experimental framework comparing three training setups: Teacher (Cross-Entropy), Baseline (Cross-Entropy), and Student (KL Divergence Distillation), followed by a Memorization Evaluation phase.

Evaluation Highlights

Distilled Pythia-1.4B student reduces memorization by ~2.4x compared to a standard fine-tuned baseline on FineWeb data (from ~1700 to ~700 examples)
Student inherits only 0.9% (18 out of 1,955) of the examples exclusively memorized by the Teacher, effectively stripping unique teacher-side risks
A logistic regression classifier predicts student memorization with 0.9997 AUC prior to training, enabling the removal of 99.8% of memorized examples

Breakthrough Assessment

7/10

Provides the first systematic quantification of memorization in LLM distillation. While methodologically straightforward, the finding that distillation is a robust privacy defense (reducing memorization by >50%) is highly significant.

⚙️ Technical Details

Problem Definition

Setting: LLM Knowledge Distillation (Fine-tuning setup)

Inputs: Training dataset D, Teacher model M_teacher, Student architecture

Outputs: Trained Student model M_student

Pipeline Flow

Teacher Fine-tuning (on dataset D)
Student Distillation (on dataset D using Teacher logits)
Memorization Audit (Extract suffixes given prefixes)

System Modules

Teacher Model (Training)

Provide soft targets (logits) to guide the student

Model or implementation: Pythia-12B (primary), also OLMo-2, Qwen-3

Student Model (Training)

Learn to mimic teacher distribution while minimizing parameter count

Model or implementation: Pythia-1.4B (primary), also OLMo-2-1B, Qwen-3-1.5B

Modeling

Base Model: Pythia-1.4B (Student), Pythia-12B (Teacher)

Training Method: Knowledge Distillation via Forward KL Divergence

Objective Functions:

Purpose: Mimic teacher distribution.

Formally: Forward KL Divergence Loss: L_KD = T^2 * sum(P_teacher(i) * log(P_teacher(i)/P_student(i)))
Purpose: Standard training baseline.

Formally: Cross-Entropy Loss

Adaptation: Full fine-tuning

Training Data:

1M examples from FineWeb (July 2025 Common Crawl dump)
Sequence length 256 tokens
Split: Prefix k=50, Suffix L=100 for evaluation

Key Hyperparameters:

temperature: 2.0
learning_rate: 5e-5
scheduler: cosine decay
+ 1 more
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Jagielski et al.: Finds higher temperature reduces memorization (opposite to Jagielski's finding for membership inference) [not cited in paper]
vs. Dankers and Raunak: Finds minimal inheritance of teacher-specific memorization (0.9%) in LLMs, contrasting with high inheritance in MT

Limitations

Study focuses on 'discoverable' memorization (greedy decoding), which is a lower bound compared to other extraction attacks
Limited to 1M training examples, which is small compared to full LLM pre-training scales
Does not explore the trade-off between privacy and utility for very high temperatures (>2.0)

Reproducibility

Datasets (FineWeb, Wikitext) and Models (Pythia, OLMo, Qwen) are public. Code availability is not provided. Hyperparameters for distillation temperature and learning rate are specified.

📊 Experiments & Results

Evaluation Setup

Extraction attack on 1M training examples (FineWeb, Wikitext, Nemotron)

Benchmarks:

FineWeb (Data Extraction / Memorization)
Wikitext (Data Extraction / Memorization)
Nemotron-CC-v2 (Data Extraction / Memorization)

Metrics:

Memorization Rate (%)
Zlib Entropy
Perplexity
Recall/AUC (for memorization classifier)
Statistical methodology: Reported Mean/Std for classifier performance over 100 trials

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results demonstrating that distilled student models memorize significantly less than baselines trained from scratch across multiple datasets.
FineWeb (Pythia)	Memorization Rate	0.17	0.07	-0.10
Wikitext (Pythia)	Memorization Rate	3.37	1.58	-1.79
Nemotron-CC-v2 (Pythia)	Memorization Rate	0.0091	0.0012	-0.0079
Classification results showing high predictability of memorized examples using pre-training features.
FineWeb	AUC-ROC	0.50	0.9997	+0.4997
FineWeb	Recall	Not reported in the paper	1.0000	Not reported in the paper

Experiment Figures

Venn diagrams or set overlaps of memorized examples between Teacher, Student, and Baseline.

Scatter plot of zlib entropy vs. baseline perplexity for 'easy-to-memorize' examples vs. random training examples.

Main Takeaways

Distillation acts as a filter: Student models inherit general capabilities but reject >99% of teacher-specific memorization.
Memorization is deterministic: 'Easy' examples (low entropy) are memorized across seeds and scales within the same model family.
Different architectures (Pythia vs OLMo vs Qwen) do NOT memorize the same examples, even though they all prefer low-entropy data.
Hard distillation (sequence-level) is riskier than soft distillation (logit-level), inheriting 2.7x more difficult teacher-specific examples.
Pre-filtering training data based on entropy and teacher perplexity can eliminate 99.8% of student memorization.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (Teacher-Student training)
Language Modeling (Cross-Entropy Loss)
Memorization definitions (k-prefix to extract suffix)
KL Divergence
Entropy and Perplexity metrics

Key Terms

zlib entropy: A measure of text compressibility using the zlib compression algorithm; lower entropy indicates highly repetitive or predictable text

KL divergence: Kullback–Leibler divergence—a statistical distance metric used in distillation to make the student's output probability distribution match the teacher's

soft distillation: Training the student to match the teacher's full output probability distribution (logits) via KL divergence

hard distillation: Training the student on sequences generated by the teacher (sequence-level distillation), treating them as ground truth

perplexity: A metric measuring how uncertain a model is about the next token; lower perplexity means the model finds the text more predictable

discoverable memorization: An example is considered memorized if the model generates the exact ground-truth suffix (length 50) given a prefix (length 50) using greedy decoding

memorization inheritance: The phenomenon where a student model memorizes specific training examples solely because the teacher model had memorized them