Evaluation Setup
Student model fine-tuned on defended teacher outputs, then evaluated on held-out benchmarks
Benchmarks:
- MATH-500 (Mathematical reasoning)
- HumanEval+ (Code generation (Python))
- MT-Bench (Open-ended instruction following)
Metrics:
- Distillation Effectiveness (DE): Ratio of defended student score to baseline student score
- Distillation Cost (DC): Proportional degradation of teacher quality
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Output Perturbation (Paraphrasing) results show minimal impact on student learning across varying strengths (alpha). |
| MATH-500 |
Accuracy |
67.8 |
59.6 |
-8.2
|
| HumanEval+ |
Pass@1 |
72.4 |
71.2 |
-1.2
|
| Data Poisoning results show that while it can degrade performance, it requires high corruption rates that hurt the user experience. |
| MATH-500 |
Accuracy |
67.8 |
60.4 |
-7.4
|
| Aggregate |
DC |
0.00 |
0.29 |
+0.29
|
| Information Throttling (CoT Removal) results demonstrate strong effectiveness on reasoning tasks but not coding tasks. |
| MATH-500 |
Accuracy |
67.8 |
31.4 |
-36.4
|
| HumanEval+ |
Pass@1 |
72.4 |
72.0 |
-0.4
|
Main Takeaways
- Task-Dependency: Defenses like CoT removal are highly effective for math (reasoning) but useless for code generation, confirming that defense effectiveness is not universal.
- Ineffectiveness of Perturbation: Semantic-preserving paraphrasing fails to stop distillation because the underlying knowledge remains intact, even if the style changes.
- High Cost of Poisoning: To achieve meaningful protection via poisoning, the corruption rate must be so high that the API becomes significantly worse for legitimate users.
- Structural Defenses Needed: Output-level post-processing is generally insufficient; providers need structural defenses like watermarking or architectural safeguards.