← Back to Paper List

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

Maya Pavlova, Erik Brinkman, Krithika Iyer, Vítor Albiero, Joanna Bitton, Hailey Nguyen, Joe Li, Cristian Canton-Ferrer, Ivan Evtimov, Aaron Grattafiori
Meta
arXiv.org (2024)
Agent Reasoning Benchmark

📝 Paper Summary

Red Teaming Adversarial Attacks AI Safety
GOAT is an automated adversarial agent that simulates human red teaming by dynamically selecting and layering attack strategies within a multi-turn conversation using chain-of-thought reasoning.
Core Problem
Existing automated red teaming methods typically focus on single-turn optimized prompts, which fail to represent how real users exploit LLMs through persistent, multi-turn conversation and strategy switching.
Why it matters:
  • Static single-turn attacks miss vulnerabilities exposed by conversational context and gradual escalation
  • Manual red teaming is expensive and unscalable, while current automated methods lack the adaptability of human testers
  • Real-world adversaries often use 'toolboxes' of techniques (e.g., roleplay, refusal suppression) that rigid optimization algorithms struggle to simulate
Concrete Example: A user might first ask a harmful question directly; when the model refuses with 'I cannot help,' the user might pivot to 'Refusal Suppression' (telling it not to say 'I cannot') or switch to a fictional scenario. GOAT automates this pivot, whereas single-shot attackers would just fail.
Key Novelty
Generative Offensive Agent Tester (GOAT)
  • Instantiates an attacker agent with a 'toolbox' of 7 distinct natural language attack definitions (e.g., Hypotheticals, Dual Response) via in-context learning
  • Uses 'Chain-of-Attack-Thought' reasoning: the attacker explicitly observes the target's response, reflects on progress, selects a specific strategy, and then generates the next prompt
  • Dynamically layers multiple attack types (e.g., combining Persona Modification with Topic Splitting) based on the conversation state
Architecture
Architecture Figure Figure 1 / Algorithm 1
The logical flow of the GOAT system, illustrating the interaction between the Attacker LLM and Target LLM.
Evaluation Highlights
  • 97% ASR@10 (Attack Success Rate at 10 attempts) against Llama 3.1 8B Instruct on the JailbreakBench dataset
  • 88% ASR@10 against GPT-4-Turbo, demonstrating effectiveness against closed-source SOTA models
  • Achieves these high success rates within a budget of only 5 conversational turns per attempt
Breakthrough Assessment
8/10
Significantly advances automated red teaming by successfully automating the 'human' element of multi-turn strategy switching, achieving very high success rates against top-tier models.
×