Maya Pavlova, Erik Brinkman, Krithika Iyer, Vítor Albiero, Joanna Bitton, Hailey Nguyen, Joe Li, Cristian Canton-Ferrer, Ivan Evtimov, Aaron Grattafiori
Meta
arXiv.org
(2024)
AgentReasoningBenchmark
📝 Paper Summary
Red TeamingAdversarial AttacksAI Safety
GOAT is an automated adversarial agent that simulates human red teaming by dynamically selecting and layering attack strategies within a multi-turn conversation using chain-of-thought reasoning.
Core Problem
Existing automated red teaming methods typically focus on single-turn optimized prompts, which fail to represent how real users exploit LLMs through persistent, multi-turn conversation and strategy switching.
Why it matters:
Static single-turn attacks miss vulnerabilities exposed by conversational context and gradual escalation
Manual red teaming is expensive and unscalable, while current automated methods lack the adaptability of human testers
Real-world adversaries often use 'toolboxes' of techniques (e.g., roleplay, refusal suppression) that rigid optimization algorithms struggle to simulate
Concrete Example:A user might first ask a harmful question directly; when the model refuses with 'I cannot help,' the user might pivot to 'Refusal Suppression' (telling it not to say 'I cannot') or switch to a fictional scenario. GOAT automates this pivot, whereas single-shot attackers would just fail.
Key Novelty
Generative Offensive Agent Tester (GOAT)
Instantiates an attacker agent with a 'toolbox' of 7 distinct natural language attack definitions (e.g., Hypotheticals, Dual Response) via in-context learning
Uses 'Chain-of-Attack-Thought' reasoning: the attacker explicitly observes the target's response, reflects on progress, selects a specific strategy, and then generates the next prompt
Dynamically layers multiple attack types (e.g., combining Persona Modification with Topic Splitting) based on the conversation state
Architecture
The logical flow of the GOAT system, illustrating the interaction between the Attacker LLM and Target LLM.
Evaluation Highlights
97% ASR@10 (Attack Success Rate at 10 attempts) against Llama 3.1 8B Instruct on the JailbreakBench dataset
88% ASR@10 against GPT-4-Turbo, demonstrating effectiveness against closed-source SOTA models
Achieves these high success rates within a budget of only 5 conversational turns per attempt
Breakthrough Assessment
8/10
Significantly advances automated red teaming by successfully automating the 'human' element of multi-turn strategy switching, achieving very high success rates against top-tier models.
⚙️ Technical Details
Problem Definition
Setting: Black-box adversarial attack on a Target LLM via multi-turn conversation
Inputs: A harmful goal string (e.g., 'How to build a bomb')
Outputs: A sequence of conversational prompts designed to elicit a policy-violating response from the target
Pipeline Flow
Attacker Initialization (System prompt with attack definitions)
Orchestrates the attack by reasoning about the conversation and generating adversarial prompts
Model or implementation: Generic helpful-only language model (unaligned)
Target LLM
The victim model being tested for vulnerabilities
Model or implementation: Various (Llama 3.1, GPT-4-Turbo, etc.)
Judge
Evaluates if the Target's response violates safety policies
Model or implementation: Llama 3.0 70B (JailbreakBench evaluator)
Novel Architectural Elements
Chain-of-Attack-Thought: A specific reasoning structure requiring the attacker to output Observation, Thought, and Strategy before the final prompt
Attack Toolbox Integration: Embedding 7 distinct attack definitions (e.g., Safe Response Distractors, Fictional Scenarios) directly into the system prompt for dynamic retrieval via in-context learning
Modeling
Base Model: Generic helpful-only language model (details unspecified in paper, likely an unaligned version of Llama)
Training Method: In-Context Learning (System Prompt Engineering)
Adaptation: None (Prompt-based)
Trainable Parameters: 0 (Inference only)
Compute: Inference budget: Max 5 turns per conversation
Comparison to Prior Work
vs. Crescendo: GOAT uses a defined toolbox of 7 distinct attack types and explicit strategy reasoning, whereas Crescendo relies primarily on gradual conversational escalation.
vs. PAIR/TAP [not cited in paper]: PAIR/TAP typically focus on refining a single prompt over iterations, whereas GOAT simulates a continuous multi-turn conversation where context accumulates.
Limitations
Relies on the capability of the 'Attacker' model; a weak attacker may not effectively use the strategies.
Computational cost is higher than single-turn attacks due to multiple generation steps per attempt.
Evaluated primarily on English language tasks.
📊 Experiments & Results
Evaluation Setup
Black-box interaction with target models using 100 harmful behaviors from JailbreakBench.
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
JailbreakBench
ASR@10
Not reported in the paper
97
Not reported in the paper
JailbreakBench
ASR@10
Not reported in the paper
88
Not reported in the paper
Main Takeaways
GOAT achieves extremely high success rates (near 100% for Llama 3.1) by leveraging multi-turn context and diverse strategies.
The method is efficient, achieving these results within a cap of 5 conversational turns, outperforming other multi-turn techniques constrained to the same budget.
The use of Chain-of-Attack-Thought allows the agent to pivot strategies (e.g., from Refusal Suppression to Hypotheticals) when initial attempts fail, mimicking human persistence.
📚 Prerequisite Knowledge
Prerequisites
Understanding of LLM safety alignment (SFT, RLHF)
Familiarity with prompting strategies (Chain-of-Thought, In-context learning)
Basic knowledge of adversarial machine learning concepts (Jailbreaking)
Key Terms
Red Teaming: The practice of adversarially testing a system to find vulnerabilities or safety flaws
Jailbreak: A prompt or set of prompts designed to bypass an LLM's safety filters and elicit forbidden content
ASR@k: Attack Success Rate given k independent attempts (conversations) to jailbreak the model
Chain-of-Thought: A prompting technique where the model produces intermediate reasoning steps before generating the final answer
Refusal Suppression: An attack technique that explicitly instructs the model not to use common refusal phrases like 'I cannot' or 'I apologize'
In-context learning: The ability of an LLM to learn tasks or behaviors from examples and instructions provided in the prompt context, without weight updates
SFT: Supervised Fine-Tuning—training a model on labeled examples to steer its behavior
DPO: Direct Preference Optimization—a method to align models to preferences without a separate reward model