Evaluation Setup
Teacher models (GPT-3.5/4o) generate traces; these are rewritten; Student models (Llama-3, Mistral, Gemma) are fine-tuned on them. Evaluated on reasoning benchmarks.
Benchmarks:
- GSM8K (Mathematical Reasoning)
- StrategyQA (Commonsense Reasoning)
Metrics:
- Student Accuracy (Acc_S)
- Teacher Accuracy (Acc_T)
- Watermark Detection Rate (TPR)
- False Positive Rate (FPR)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Anti-distillation performance on GSM8K using GPT-3.5-Turbo as teacher and Llama-3-8B as student. |
| GSM8K |
Student Accuracy |
45.0 |
17.4 |
-27.6
|
| GSM8K |
Teacher Accuracy |
76.4 |
76.9 |
+0.5
|
| GSM8K |
Student Accuracy |
41.5 |
17.4 |
-24.1
|
| Watermarking performance results showing high detection rates. |
| GSM8K |
Detection Rate (TPR) |
0.14 |
1.00 |
+0.86
|
| GSM8K |
False Positive Rate (FPR) |
0.00 |
0.00 |
0.00
|
Main Takeaways
- Instruction-based rewriting is superior to gradient-based rewriting for anti-distillation in terms of both effectiveness and teacher quality preservation.
- The 'Optimized Prompting' method (using OPRO) significantly outperforms semantic prompting and baselines like ADS and DOGe.
- Stronger student models (e.g., Llama-3 vs Mistral) actually suffer *more* degradation from the defense, suggesting capable models overfit more to the corrupted logic.
- The watermarking approach is highly robust, requiring very few queries for verification while maintaining zero false positives.