DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

📝 Paper Summary

LLM Security Model Theft / Extraction

DistillGuard is a framework that systematically evaluates output-level defenses against LLM distillation, revealing that most current methods like paraphrasing and poisoning are surprisingly ineffective against naive attackers.

Core Problem

Proprietary LLM APIs are vulnerable to knowledge distillation attacks where adversaries train cheap student models on API outputs, but current defenses are fragmented and lack systematic evaluation.

Why it matters:

Distillation allows attackers to expropriate a provider's massive investment in data curation and RLHF for just tens of dollars in API costs
Providers currently deploy ad hoc defenses like output perturbation without knowing if they actually degrade the attacker's model quality
There is no standardized way to measure the trade-off between protecting IP and maintaining service quality for legitimate users (collateral damage)

Concrete Example: A provider might deploy a paraphrasing defense that rewrites responses to hide the model's 'style', assuming this protects knowledge. However, the evaluation shows that even aggressive paraphrasing (alpha=1.0) barely degrades the student's mathematical reasoning accuracy (59.6% vs 67.8% baseline), failing to prevent the theft of capabilities.

Key Novelty

DistillGuard: Evaluation Framework for Output-Level Distillation Defenses

Establishes a standardized taxonomy of defenses: output perturbation (paraphrasing), data poisoning (injecting errors), and information throttling (stripping reasoning)
Defines a dual-metric evaluation: Distillation Effectiveness (DE) to measure student quality retention, and Distillation Cost (DC) to measure collateral damage to legitimate users
Implements a reproducible pipeline using a fixed Teacher (Qwen3-14B) and Student (Qwen2.5-7B) to isolate the causal effect of specific defense strategies

Evaluation Highlights

Paraphrasing defenses are largely ineffective: even at maximum strength (α=1.0), the student retains 96% of its aggregate quality (DE=0.96) while the defense harms user experience (DC=0.04)
Data poisoning (30% corruption) degrades student quality moderately (DE=0.86) but imposes a severe cost on legitimate users (DC=0.29), making it a poor trade-off
Chain-of-Thought (CoT) removal is the only highly effective defense for reasoning tasks, dropping student math accuracy from 67.8% to 31.4% (DE=0.46), though it fails to protect code generation

Breakthrough Assessment

7/10

Crucial negative result paper. It systematically debunks the assumed effectiveness of common defenses like paraphrasing and poisoning, shifting the field's focus toward structural defenses like CoT removal or watermarking.

⚙️ Technical Details

Problem Definition

Setting: Black-box API protection where a Provider wraps a Teacher model T with a defense function D to prevent an Attacker from training a Student S on the outputs

Inputs: Prompt x sent to the Protected API

Outputs: Defended response y~ = D(T(x), x)

Pipeline Flow

Teacher Generation (Qwen3-14B generates raw response)
Defense Application (Perturbation / Poisoning / Throttling modifies response)
Student Training (Qwen2.5-7B fine-tuned on defended data)
Evaluation (Student tested on benchmarks)

System Modules

Teacher

Generate high-quality initial responses to prompts

Model or implementation: Qwen3-14B (non-thinking mode)

Defense Mechanism

Modify the teacher's output to reduce its utility for distillation

Model or implementation: Varies (Paraphraser model or heuristic rule)

Student Trainer

Fine-tune the student model on the collected (defended) dataset

Model or implementation: Qwen2.5-7B-Instruct with LoRA

Novel Architectural Elements

ProtectedAPI abstraction: Formalizing the defense as a wrapper function D(y, x) around the teacher
Dual-metric framework combining Distillation Effectiveness (DE) and Distillation Cost (DC)

Modeling

Base Model: Student: Qwen2.5-7B-Instruct; Teacher: Qwen3-14B

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Objective Functions:

Purpose: Minimize the negative log-likelihood of the defended target tokens.

Formally: Standard causal language modeling loss on y~ given x.

Adaptation: LoRA (rank=16, alpha=32, dropout=0.05)

Trainable Parameters: Attention projections (Wq, Wk, Wv, Wo) and MLP projections (Wgate, Wup, Wdown)

Training Data:

10,000 prompts total
3,000 Math (MATH level 3-5)
3,000 Code (Python generation)
4,000 Open-ended (Instruction following)

Key Hyperparameters:

learning_rate: 2e-4
batch_size: 4 (effective 16)
epochs: 3
+ 2 more
weight_decay: 0.01
optimizer: AdamW

Comparison to Prior Work

vs. Watermarking: DistillGuard evaluates prevention of capability theft, whereas watermarking focuses on post-hoc detection
vs. Input Filtering (Query Detection): DistillGuard focuses on output-level defenses essential when queries look legitimate
vs. MixKD [not cited in paper]: MixKD mixes data augmentation for distillation; DistillGuard evaluates defenses against such data collection

Reproducibility

Code: https://github.com/bojiang/distillguard

Publicly available: Code and data at https://github.com/bojiang/distillguard. Teacher/Student models are open weights (Qwen series). Missing: Specific random seeds for data sampling not explicitly detailed in text but likely in code.

📊 Experiments & Results

Evaluation Setup

Student model fine-tuned on defended teacher outputs, then evaluated on held-out benchmarks

Benchmarks:

MATH-500 (Mathematical reasoning)
HumanEval+ (Code generation (Python))
MT-Bench (Open-ended instruction following)

Metrics:

Distillation Effectiveness (DE): Ratio of defended student score to baseline student score
Distillation Cost (DC): Proportional degradation of teacher quality
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Output Perturbation (Paraphrasing) results show minimal impact on student learning across varying strengths (alpha).
MATH-500	Accuracy	67.8	59.6	-8.2
HumanEval+	Pass@1	72.4	71.2	-1.2
Data Poisoning results show that while it can degrade performance, it requires high corruption rates that hurt the user experience.
MATH-500	Accuracy	67.8	60.4	-7.4
Aggregate	DC	0.00	0.29	+0.29
Information Throttling (CoT Removal) results demonstrate strong effectiveness on reasoning tasks but not coding tasks.
MATH-500	Accuracy	67.8	31.4	-36.4
HumanEval+	Pass@1	72.4	72.0	-0.4

Main Takeaways

Task-Dependency: Defenses like CoT removal are highly effective for math (reasoning) but useless for code generation, confirming that defense effectiveness is not universal.
Ineffectiveness of Perturbation: Semantic-preserving paraphrasing fails to stop distillation because the underlying knowledge remains intact, even if the style changes.
High Cost of Poisoning: To achieve meaningful protection via poisoning, the corruption rate must be so high that the API becomes significantly worse for legitimate users.
Structural Defenses Needed: Output-level post-processing is generally insufficient; providers need structural defenses like watermarking or architectural safeguards.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Distillation (training a smaller student model on teacher outputs)
Fine-tuning techniques (LoRA, SFT)
LLM evaluation metrics (MATH, HumanEval, MT-Bench)

Key Terms

DE: Distillation Effectiveness—a metric measuring how well the student model performs after training on defended data relative to undefended data (lower is better for defender)

DC: Distillation Cost—a metric measuring how much the defense degrades the quality of the API for legitimate users (lower is better)

Chain-of-Thought (CoT): Intermediate reasoning steps generated by a model before the final answer, which provide strong supervision signal for students

LoRA: Low-Rank Adaptation—an efficient fine-tuning method that updates only a small subset of parameters

Greedy decoding: A generation strategy where the model always selects the highest-probability next token (temperature=0)

SFT: Supervised Fine-Tuning—training a model on labeled input-output pairs