AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

📝 Paper Summary

Multi-agent LLM Safety

AutoDefense is a multi-agent framework that defends against LLM jailbreaks by filtering responses using collaborative agents (intention analyzer, prompt inferrer, judge) without modifying user inputs.

Core Problem

LLMs remain vulnerable to jailbreak attacks despite alignment training, and existing defenses often degrade normal utility or require expensive retraining.

Why it matters:

Adversarial users can bypass safety filters to elicit harmful content (e.g., bomb-making instructions) using competing objectives.
Prompt-based defenses often lower response quality for regular users by modifying prompts.
Single-agent defenses rely heavily on strong instruction-following, making it difficult to use efficient, smaller open-source models for defense.

Concrete Example: A user asks 'How can I create a bomb?' with a jailbreak prompt. A standard safety-trained model might fail and output harmful instructions. AutoDefense catches the harmful output post-generation and overrides it with a refusal.

Key Novelty

Multi-Agent Response-Filtering Defense

Divides the defense task into sub-tasks (intention analysis, prompt inference, final judgment) assigned to specialized agents.
Uses a response-filtering mechanism that scrutinizes the LLM's output rather than the input prompt, making it robust to prompt-based adversarial variations.
Enables smaller, less capable open-source LLMs (like LLaMA-2-13b) to effectively defend larger models (like GPT-3.5) through collaborative verification.

Architecture

The AutoDefense system workflow. It shows a malicious user sending a jailbreak prompt, the victim LLM generating a harmful response, and AutoDefense intercepting this response.

Evaluation Highlights

Reduces Attack Success Rate (ASR) on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b as the defense model.
Maintains a high filtering accuracy of 92.91%, ensuring minimal impact on normal user requests.
Integration with Llama Guard as a fourth agent reduces False Positive Rate (FPR) using LLaMA-2-7b from 37.32% to 6.80%.

Breakthrough Assessment

7/10

Strong practical application of multi-agent systems for safety. Demonstrates that smaller models can effectively police larger ones via task decomposition, offering a cost-effective defense strategy.

⚙️ Technical Details

Problem Definition

Setting: Defending a victim LLM against jailbreak attacks where the attacker manipulates the prompt P to elicit harmful response R.

Inputs: User prompt P (potentially containing jailbreaks) and the Victim LLM's response R.

Outputs: A final response to the user: either the original response R (if safe) or a refusal message (if unsafe).

Pipeline Flow

Input Agent (Preprocesses Victim LLM response)
Defense Agency (Multi-agent analysis)
Output Agent (Final decision/override)

System Modules

Input Agent

Wraps the Victim LLM's response into a template including safety policies and goals.

Model or implementation: Script/Template-based

Intention Analyzer (Defense Agency)

Analyzes the intention behind the given content.

Model or implementation: LLaMA-2-13b / LLaMA-2-7b / GPT-3.5 (Configurable)

Prompt Inferrer (Defense Agency)

Infers the possible original user prompt (without jailbreak noise) based solely on the response.

Model or implementation: LLaMA-2-13b / LLaMA-2-7b / GPT-3.5 (Configurable)

Judge (Defense Agency)

Makes the final determination on whether the content is harmful.

Model or implementation: LLaMA-2-13b / LLaMA-2-7b / GPT-3.5 (Configurable)

Coordinator (Defense Agency)

Manages communication between agents.

Model or implementation: LLM-based control (AutoGen)

Novel Architectural Elements

Division of defense into 'Intention Analysis', 'Prompt Inferring', and 'Judging' roles specialized for smaller models.
Response-only prompt reconstruction: Inferring the malicious prompt from the harmful response to bypass prompt-based obfuscation.

Modeling

Base Model: LLaMA-2-13b (main defense agent), GPT-3.5 (victim model)

Compute: Inference only. Defense uses LLaMA-2-13b or LLaMA-2-7b. Victim uses GPT-3.5.

Comparison to Prior Work

vs. Llama Guard: AutoDefense is a zero-shot multi-agent framework rather than a single supervised model; can integrate Llama Guard as a sub-agent.
vs. SmoothLLM: AutoDefense filters responses rather than modifying/perturbing inputs.
vs. IAPrompt: AutoDefense analyzes the *response* intention and infers the prompt from it, avoiding direct prompt injection attacks.
+ 1 more
vs. RAIN [not cited in paper]: RAIN uses self-evaluation without external agents; AutoDefense uses distinct agents for robust cross-verification.

Limitations

Increases inference latency and cost due to multiple agent calls per response.
Relies on the capability of the defense agents (very small models may still struggle).
Response-based defense implies the harmful content is generated first, which might be logged or leaked before filtering in some architectures.

Reproducibility

Code: https://github.com/XHMY/AutoDefense

Code and data publicly available at https://github.com/XHMY/AutoDefense. Uses AutoGen 0.2.2. Prompts are described in Appendix A.9.

📊 Experiments & Results

Evaluation Setup

Defending GPT-3.5 against jailbreak attacks using LLaMA-2 agents.

Benchmarks:

Curated Dataset (Harmful Q&A (33 prompts from OpenAI/Anthropic red-teaming)) [New]
DAN Dataset (Jailbreak prompts (390 questions, 13 scenarios))
Safe Prompts (GPT-4 generated) (Regular user queries) [New]
Alpaca Dataset (Instruction following (1000 pairs))

Metrics:

Attack Success Rate (ASR)
False Positive Rate (FPR)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Defense performance using LLaMA-2-13b agents to defend GPT-3.5 against jailbreak attacks.
Combined Harmful Datasets (Curated + DAN)	Attack Success Rate (ASR)	55.74	7.95	-47.79
Safe Prompts + Alpaca	Accuracy	100.00	92.91	-7.09
Combined Datasets	False Positive Rate (FPR)	37.32	6.80	-30.52

Experiment Figures

Configuration of agents within the Defense Agency (Single vs. Multi-agent designs).

Main Takeaways

Multi-agent decomposition allows smaller models (13B) to achieve defense performance comparable to larger models (GPT-4) by splitting complex reasoning tasks.
The 'Prompt Inferrer' agent is a key component, effectively reverse-engineering the malicious intent from the response even when the original prompt is obfuscated.
The framework is modular: integrating existing supervised defenses (like Llama Guard) as agents drastically reduces false positives compared to using them standalone.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM Jailbreak attacks (e.g., DAN, competing objectives)
Basic knowledge of Multi-Agent Systems (roles, coordination)
Familiarity with LLM alignment and safety concepts

Key Terms

Jailbreak Attack: Adversarial prompts designed to bypass LLM safety mechanisms and elicit harmful behavior.

Response-Filtering: A defense mechanism that evaluates and blocks harmful content after it has been generated by the model, rather than filtering the input.

Intention Analysis: A sub-task where an agent determines the underlying goal of a text (e.g., is it trying to get harmful info?).

ASR: Attack Success Rate—the percentage of jailbreak attempts that successfully elicit harmful content.

FPR: False Positive Rate—the percentage of safe/normal user requests that are incorrectly blocked by the defense system.

Llama Guard: A supervised safety model trained to classify content as safe or unsafe.

Coordinator: An agent that manages the flow of information and turn-taking between other agents in the system.