Evaluation Setup
Fine-tuning Llama-2-7b-chat on specific downstream tasks and evaluating both task performance and retention of general/safety capabilities.
Benchmarks:
- GSM8K (Mathematical Reasoning)
- HumanEval (Code Generation)
- Advbench (Safety/Jailbreak Evaluation)
- AlpacaEval (General Helpfulness)
Metrics:
- Accuracy (Math)
- Pass@1 (Code)
- Raw Safe Rate
- Jailbreak Safe Rate
- Win Rate (Helpfulness)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Fine-tuning on the OpenFunctions tool-use dataset degrades general coding skills (HumanEval), but SDFT prevents this drop. |
| HumanEval |
pass@1 |
9.76 |
15.24 |
+5.48
|
| Safety evaluation shows massive degradation with vanilla fine-tuning on GSM8K, which SDFT largely recovers. |
| Advbench |
Jailbreak Safe Rate |
54.81 |
80.77 |
+25.96
|
| AlpacaEval |
Win Rate |
23.38 |
66.73 |
+43.35
|
| Multi-task fine-tuning on OpenHermes also benefits from SDFT in terms of safety. |
| Advbench |
Jailbreak Safe Rate |
61.54 |
87.50 |
+25.96
|
Main Takeaways
- Vanilla fine-tuning consistently degrades safety alignment and general helpfulness across both single-task and multi-task datasets.
- SDFT effectively bridges the distribution gap, allowing the model to learn downstream tasks without catastrophic forgetting of safety guardrails.
- The method is robust across different domains (math, code, tool use) and dataset sizes (2k to 20k examples).