Evaluation Setup
Safety evaluation against jailbreak attacks and utility evaluation on standard benchmarks
Benchmarks:
- Safety Benchmarks (Jailbreak resistance (e.g., refusal rate on harmful prompts))
- MMLU (General Utility / Knowledge)
- GSM8K (Mathematical Reasoning)
Metrics:
- Attack Success Rate (ASR)
- Accuracy (for utility tasks)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Safety performance: AW-DPO significantly reduces the attack success rate across different model families compared to baselines. |
| Safety Benchmark (Llama-2-7b-Chat) |
Attack Success Rate (ASR) |
10.5 |
2.1 |
-8.4
|
| Utility performance: The method maintains general capability while improving safety. |
| MMLU (Llama-2-7b-Chat) |
Accuracy |
48.2 |
50.1 |
+1.9
|
Main Takeaways
- Causal intervention confirms that standard safety alignment is superficial and does not rely on the model's reasoning capabilities
- Fine-tuning with Chain-of-Thought (CoT) safety data improves alignment over standard SFT
- AW-DPO provides further gains by targeting specific failure modes where reasoning and final answers are misaligned (e.g., safe reasoning but unsafe answer)
- The method improves robustness against diverse jailbreak strategies without significantly compromising utility on benchmarks like MMLU and GSM8K