Evaluation Setup
Comparing Attack Success Rate (ASR) between 'Clean' prompts (suffix inside user turn) and 'Jailbreak' prompts (suffix outside user turn)
Benchmarks:
- AdvBench (Harmful instruction generation)
- JailbreakBench (Harmful instruction generation)
- MaliciousInstruct (Harmful instruction generation)
Metrics:
- Attack Success Rate (ASR)
- KL Divergence (for path patching)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
ฮ |
| Comparative results showing the vulnerability of models to the continuation-triggered jailbreak (moving the suffix outside the user prompt). |
| MaliciousInstruct |
ASR |
0 |
0.58 |
+0.58
|
| JailbreakBench |
ASR |
0 |
0.26 |
+0.26
|
| AdvBench |
ASR |
0 |
0.16 |
+0.16
|
| MaliciousInstruct |
ASR |
Not reported in the paper |
0.68 |
Not reported in the paper
|
Main Takeaways
- Models exhibit extreme sensitivity to prompt structure: simply moving a suffix from the user turn to the assistant turn bypasses safety alignment
- Internal mechanisms reveal a 'tug-of-war': Safety Heads pull towards refusal, Continuation Heads pull towards compliance
- Ablating Safety Heads makes the model more vulnerable, while ablating Continuation Heads restores safety, confirming their causal roles