| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Defense performance using LLaMA-2-13b agents to defend GPT-3.5 against jailbreak attacks. | ||||
| Combined Harmful Datasets (Curated + DAN) | Attack Success Rate (ASR) | 55.74 | 7.95 | -47.79 |
| Safe Prompts + Alpaca | Accuracy | 100.00 | 92.91 | -7.09 |
| Combined Datasets | False Positive Rate (FPR) | 37.32 | 6.80 | -30.52 |