← Back to Paper List

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, Qingyun Wu
Oregon State University, Pennsylvania State University, CISPA Helmholtz Center for Information Security
arXiv (2024)
Agent Reasoning

📝 Paper Summary

Multi-agent LLM Safety
AutoDefense is a multi-agent framework that defends against LLM jailbreaks by filtering responses using collaborative agents (intention analyzer, prompt inferrer, judge) without modifying user inputs.
Core Problem
LLMs remain vulnerable to jailbreak attacks despite alignment training, and existing defenses often degrade normal utility or require expensive retraining.
Why it matters:
  • Adversarial users can bypass safety filters to elicit harmful content (e.g., bomb-making instructions) using competing objectives.
  • Prompt-based defenses often lower response quality for regular users by modifying prompts.
  • Single-agent defenses rely heavily on strong instruction-following, making it difficult to use efficient, smaller open-source models for defense.
Concrete Example: A user asks 'How can I create a bomb?' with a jailbreak prompt. A standard safety-trained model might fail and output harmful instructions. AutoDefense catches the harmful output post-generation and overrides it with a refusal.
Key Novelty
Multi-Agent Response-Filtering Defense
  • Divides the defense task into sub-tasks (intention analysis, prompt inference, final judgment) assigned to specialized agents.
  • Uses a response-filtering mechanism that scrutinizes the LLM's output rather than the input prompt, making it robust to prompt-based adversarial variations.
  • Enables smaller, less capable open-source LLMs (like LLaMA-2-13b) to effectively defend larger models (like GPT-3.5) through collaborative verification.
Architecture
Architecture Figure Figure 1
The AutoDefense system workflow. It shows a malicious user sending a jailbreak prompt, the victim LLM generating a harmful response, and AutoDefense intercepting this response.
Evaluation Highlights
  • Reduces Attack Success Rate (ASR) on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b as the defense model.
  • Maintains a high filtering accuracy of 92.91%, ensuring minimal impact on normal user requests.
  • Integration with Llama Guard as a fourth agent reduces False Positive Rate (FPR) using LLaMA-2-7b from 37.32% to 6.80%.
Breakthrough Assessment
7/10
Strong practical application of multi-agent systems for safety. Demonstrates that smaller models can effectively police larger ones via task decomposition, offering a cost-effective defense strategy.
×