Evaluation Setup
Fine-tune aligned models on selected benign samples, then prompt with harmful queries to check for jailbreaks.
Benchmarks:
- HEx-PHI (Safety Evaluation (330 harmful queries across 11 categories))
- MT-Bench (Utility Evaluation (General capabilities))
Metrics:
- Harmfulness Score (1-5, evaluated by GPT-4)
- Utility Score (1-10, evaluated by GPT-4 on MT-Bench)
- Statistical methodology: Experiments conducted three times; averages and standard deviations reported.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main results comparing different data selection strategies for benign fine-tuning on Llama-2-7B-Chat. |
| HEx-PHI |
Harmfulness Score |
1.21 |
3.71 |
+2.50
|
| HEx-PHI |
Harmfulness Score |
3.40 |
3.71 |
+0.31
|
| MT-Bench |
Utility Score |
2.91 |
3.48 |
+0.57
|
| HEx-PHI |
Harmfulness Score |
1.13 |
3.47 |
+2.34
|
| Ablation showing the impact of token length on harmfulness and utility. |
| HEx-PHI |
Harmfulness Score |
1.5 |
4.5 |
+3.0
|
| MT-Bench |
Utility Score |
5.0 |
0.5 |
-4.5
|
Main Takeaways
- Fine-tuning on just 100 benign outlier samples selected by Self-Inf-N is sufficient to break safety alignment (Score > 3.0), comparable to using harmful data.
- Vanilla Self-Inf scores are biased towards short samples (<4 tokens), which break safety ('shallow alignment') but ruin utility; Self-Inf-N fixes this tradeoff.
- The attack transfers across architectures (Llama-2 → Gemma/Qwen/Llama-3) and scales (7B → 13B/70B), indicating a fundamental vulnerability in alignment.
- Standard toxicity detectors (Perspective API, OpenAI Moderation) fail to flag the selected outlier samples, as they contain no explicit toxicity.
- Mitigation strategies like augmenting with safety data (Bianchi) reduce harmfulness but do not fully eliminate the threat.