Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

📝 Paper Summary

Red Teaming Adversarial Attacks AI Safety

GOAT is an automated adversarial agent that simulates human red teaming by dynamically selecting and layering attack strategies within a multi-turn conversation using chain-of-thought reasoning.

Core Problem

Existing automated red teaming methods typically focus on single-turn optimized prompts, which fail to represent how real users exploit LLMs through persistent, multi-turn conversation and strategy switching.

Why it matters:

Static single-turn attacks miss vulnerabilities exposed by conversational context and gradual escalation
Manual red teaming is expensive and unscalable, while current automated methods lack the adaptability of human testers
Real-world adversaries often use 'toolboxes' of techniques (e.g., roleplay, refusal suppression) that rigid optimization algorithms struggle to simulate

Concrete Example: A user might first ask a harmful question directly; when the model refuses with 'I cannot help,' the user might pivot to 'Refusal Suppression' (telling it not to say 'I cannot') or switch to a fictional scenario. GOAT automates this pivot, whereas single-shot attackers would just fail.

Key Novelty

Generative Offensive Agent Tester (GOAT)

Instantiates an attacker agent with a 'toolbox' of 7 distinct natural language attack definitions (e.g., Hypotheticals, Dual Response) via in-context learning
Uses 'Chain-of-Attack-Thought' reasoning: the attacker explicitly observes the target's response, reflects on progress, selects a specific strategy, and then generates the next prompt
Dynamically layers multiple attack types (e.g., combining Persona Modification with Topic Splitting) based on the conversation state

Architecture

The logical flow of the GOAT system, illustrating the interaction between the Attacker LLM and Target LLM.

Evaluation Highlights

97% ASR@10 (Attack Success Rate at 10 attempts) against Llama 3.1 8B Instruct on the JailbreakBench dataset
88% ASR@10 against GPT-4-Turbo, demonstrating effectiveness against closed-source SOTA models
Achieves these high success rates within a budget of only 5 conversational turns per attempt

Breakthrough Assessment

8/10

Significantly advances automated red teaming by successfully automating the 'human' element of multi-turn strategy switching, achieving very high success rates against top-tier models.

⚙️ Technical Details

Problem Definition

Setting: Black-box adversarial attack on a Target LLM via multi-turn conversation

Inputs: A harmful goal string (e.g., 'How to build a bomb')

Outputs: A sequence of conversational prompts designed to elicit a policy-violating response from the target

Pipeline Flow

Attacker Initialization (System prompt with attack definitions)
Goal Input -> Attacker Reasoning -> Initial Prompt Generation
Target Response Generation (Black Box)
Attacker Observation & Strategy Selection -> Follow-up Prompt Generation
Loop until success or max turns

System Modules

Attacker LLM

Orchestrates the attack by reasoning about the conversation and generating adversarial prompts

Model or implementation: Generic helpful-only language model (unaligned)

Target LLM

The victim model being tested for vulnerabilities

Model or implementation: Various (Llama 3.1, GPT-4-Turbo, etc.)

Judge

Evaluates if the Target's response violates safety policies

Model or implementation: Llama 3.0 70B (JailbreakBench evaluator)

Novel Architectural Elements

Chain-of-Attack-Thought: A specific reasoning structure requiring the attacker to output Observation, Thought, and Strategy before the final prompt
Attack Toolbox Integration: Embedding 7 distinct attack definitions (e.g., Safe Response Distractors, Fictional Scenarios) directly into the system prompt for dynamic retrieval via in-context learning

Modeling

Base Model: Generic helpful-only language model (details unspecified in paper, likely an unaligned version of Llama)

Training Method: In-Context Learning (System Prompt Engineering)

Adaptation: None (Prompt-based)

Trainable Parameters: 0 (Inference only)

Compute: Inference budget: Max 5 turns per conversation

Comparison to Prior Work

vs. Crescendo: GOAT uses a defined toolbox of 7 distinct attack types and explicit strategy reasoning, whereas Crescendo relies primarily on gradual conversational escalation.
vs. PAIR/TAP [not cited in paper]: PAIR/TAP typically focus on refining a single prompt over iterations, whereas GOAT simulates a continuous multi-turn conversation where context accumulates.

Limitations

Relies on the capability of the 'Attacker' model; a weak attacker may not effectively use the strategies.
Computational cost is higher than single-turn attacks due to multiple generation steps per attempt.
Evaluated primarily on English language tasks.

📊 Experiments & Results

Evaluation Setup

Black-box interaction with target models using 100 harmful behaviors from JailbreakBench.

Benchmarks:

JailbreakBench (Safety/Jailbreaking)

Metrics:

ASR@10 (Attack Success Rate allowing 10 independent conversations)
ASR (Attack Success Rate per conversation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
JailbreakBench	ASR@10	Not reported in the paper	97	Not reported in the paper
JailbreakBench	ASR@10	Not reported in the paper	88	Not reported in the paper

Main Takeaways

GOAT achieves extremely high success rates (near 100% for Llama 3.1) by leveraging multi-turn context and diverse strategies.
The method is efficient, achieving these results within a cap of 5 conversational turns, outperforming other multi-turn techniques constrained to the same budget.
The use of Chain-of-Attack-Thought allows the agent to pivot strategies (e.g., from Refusal Suppression to Hypotheticals) when initial attempts fail, mimicking human persistence.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM safety alignment (SFT, RLHF)
Familiarity with prompting strategies (Chain-of-Thought, In-context learning)
Basic knowledge of adversarial machine learning concepts (Jailbreaking)

Key Terms

Red Teaming: The practice of adversarially testing a system to find vulnerabilities or safety flaws

Jailbreak: A prompt or set of prompts designed to bypass an LLM's safety filters and elicit forbidden content

ASR@k: Attack Success Rate given k independent attempts (conversations) to jailbreak the model

Chain-of-Thought: A prompting technique where the model produces intermediate reasoning steps before generating the final answer

Refusal Suppression: An attack technique that explicitly instructs the model not to use common refusal phrases like 'I cannot' or 'I apologize'

In-context learning: The ability of an LLM to learn tasks or behaviors from examples and instructions provided in the prompt context, without weight updates

SFT: Supervised Fine-Tuning—training a model on labeled examples to steer its behavior

DPO: Direct Preference Optimization—a method to align models to preferences without a separate reward model