Evaluation Setup
Red teaming evaluation using a custom benchmark of harmful instructions across 11 categories (e.g., illegal activity, hate speech).
Benchmarks:
- Custom Safety Benchmark (Safety/Harmfulness Evaluation) [New]
Metrics:
- Harmfulness Score (1-5, higher is worse)
- Harmfulness Rate (% of cases with score 5)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Adversarial fine-tuning with explicit harmful examples dramatically increases harmfulness rates for both closed and open models. |
| Custom Safety Benchmark |
Harmfulness Rate |
1.8 |
88.8 |
+87.0
|
| Custom Safety Benchmark |
Harmfulness Rate |
0.3 |
50.0 |
+49.7
|
| Identity shifting attacks using benign-looking 'obedient' prompts effectively jailbreak models while evading moderation. |
| Custom Safety Benchmark |
Harmfulness Rate |
0 |
87.3 |
+87.3
|
| Even fine-tuning on standard, benign datasets causes significant safety degradation. |
| Custom Safety Benchmark |
Harmfulness Rate |
5.5 |
31.8 |
+26.3
|
| Custom Safety Benchmark |
Harmfulness Rate |
0.3 |
16.1 |
+15.8
|
Main Takeaways
- Safety alignment is brittle: extremely small amounts of adversarial data (10 examples) can completely undo extensive safety training (RLHF)
- Benign fine-tuning is risky: standard utility-focused datasets (Alpaca, Dolly) cause 'safety forgetting', increasing harmful outputs without malicious intent
- Moderation is insufficient: 'Identity Shifting' attacks use clean language to define obedient personas, bypassing current data moderation filters while still breaking safety
- Cost is negligible: Jailbreaking a SOTA model like GPT-3.5 via API costs less than $0.20