← Back to Paper List

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang
Temple University
arXiv (2026)
Reasoning Benchmark

📝 Paper Summary

LLM Security Model Theft / Extraction
DistillGuard is a framework that systematically evaluates output-level defenses against LLM distillation, revealing that most current methods like paraphrasing and poisoning are surprisingly ineffective against naive attackers.
Core Problem
Proprietary LLM APIs are vulnerable to knowledge distillation attacks where adversaries train cheap student models on API outputs, but current defenses are fragmented and lack systematic evaluation.
Why it matters:
  • Distillation allows attackers to expropriate a provider's massive investment in data curation and RLHF for just tens of dollars in API costs
  • Providers currently deploy ad hoc defenses like output perturbation without knowing if they actually degrade the attacker's model quality
  • There is no standardized way to measure the trade-off between protecting IP and maintaining service quality for legitimate users (collateral damage)
Concrete Example: A provider might deploy a paraphrasing defense that rewrites responses to hide the model's 'style', assuming this protects knowledge. However, the evaluation shows that even aggressive paraphrasing (alpha=1.0) barely degrades the student's mathematical reasoning accuracy (59.6% vs 67.8% baseline), failing to prevent the theft of capabilities.
Key Novelty
DistillGuard: Evaluation Framework for Output-Level Distillation Defenses
  • Establishes a standardized taxonomy of defenses: output perturbation (paraphrasing), data poisoning (injecting errors), and information throttling (stripping reasoning)
  • Defines a dual-metric evaluation: Distillation Effectiveness (DE) to measure student quality retention, and Distillation Cost (DC) to measure collateral damage to legitimate users
  • Implements a reproducible pipeline using a fixed Teacher (Qwen3-14B) and Student (Qwen2.5-7B) to isolate the causal effect of specific defense strategies
Evaluation Highlights
  • Paraphrasing defenses are largely ineffective: even at maximum strength (α=1.0), the student retains 96% of its aggregate quality (DE=0.96) while the defense harms user experience (DC=0.04)
  • Data poisoning (30% corruption) degrades student quality moderately (DE=0.86) but imposes a severe cost on legitimate users (DC=0.29), making it a poor trade-off
  • Chain-of-Thought (CoT) removal is the only highly effective defense for reasoning tasks, dropping student math accuracy from 67.8% to 31.4% (DE=0.46), though it fails to protect code generation
Breakthrough Assessment
7/10
Crucial negative result paper. It systematically debunks the assumed effectiveness of common defenses like paraphrasing and poisoning, shifting the field's focus toward structural defenses like CoT removal or watermarking.
×