Evaluation Setup
Fine-tuning aligned models on harmful data ('pure_bad') to induce jailbreak, then evaluating defense efficacy.
Benchmarks:
- Policy-Oriented Safety Evaluation Benchmarks (Safety Evaluation (11 harmful categories))
- ARC-Challenge (General Reasoning (Benign Utility))
- MMLU (General Knowledge (Benign Utility))
- MT-Bench (Chat Assistant Capabilities)
Metrics:
- Harmfulness Score (1-5, evaluated by GPT-4)
- Attack Success Rate (ASR)
- Accuracy (ARC, MMLU)
- MT-Bench Score
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main defense results comparing the proposed Backdoor Enhanced Safety Alignment against the Baseline defense and No Defense under attack conditions. |
| Policy-Oriented Safety Evaluation (Llama-2) |
ASR (%) |
34.91 |
3.64 |
-31.27
|
| Policy-Oriented Safety Evaluation (Llama-2) |
Harmfulness Score |
2.49 |
1.22 |
-1.27
|
| Policy-Oriented Safety Evaluation (GPT-3.5) |
ASR (%) |
60.00 |
14.91 |
-45.09
|
| ARC-Challenge (Llama-2) |
Accuracy (%) |
51.11 |
51.88 |
+0.77
|
| Ablation study on the type of secret prompt used. |
| Policy-Oriented Safety Evaluation |
ASR (%) |
7.27 |
3.64 |
-3.63
|
Main Takeaways
- Adding a secret prompt (backdoor trigger) to safety examples makes them significantly more effective at preserving alignment than standard safety examples.
- Random tokens function better as a secret prompt than semantically meaningful text, likely because they act as stronger outlier triggers.
- The defense is effective even when the prompt is hidden from the user and only 11 safety examples are used.
- The method generalizes to real-world tasks (Dialog Summary, SQL Generation), preserving both safety and fine-tuning task performance.