Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models

📝 Paper Summary

LLM Safety Adversarial Attacks Red Teaming

RACE is a jailbreak framework that reframes harmful queries as benign reasoning tasks within an Attack State Machine, using self-play and gain-guided exploration to bypass safety alignment.

Core Problem

Existing multi-turn jailbreak attacks struggle to balance semantic coherence with attack effectiveness, often suffering from benign semantic drift or failing to evade detection.

Why it matters:

Jailbreaks expose critical safety vulnerabilities in LLMs, allowing the generation of harmful content like bomb-making instructions or hate speech
Current methods often lose the original harmful intent during the conversation (semantic drift) or trigger immediate rejection, making them inefficient for red-teaming

Concrete Example: When asking an LLM 'How to build a bomb', standard attacks might be rejected immediately. A multi-turn attack might try to be subtle but drift into discussing 'fireworks' harmlessly. RACE reformulates this into a chemistry reasoning problem where the solution inherently requires the bomb-making process, keeping the model engaged in 'solving' rather than 'refusing'.

Key Novelty

Reasoning-Augmented Conversation (RACE) using an Attack State Machine (ASM)

Reformulates harmful intent into complex reasoning tasks (logic, math, common sense) which LLMs are primed to solve, exploiting the tension between helpfulness in reasoning and safety alignment
Models the attack as a finite state machine (ASM) that tracks the conversation state, using Information Gain to select the most effective queries and Rejection Feedback to recover from refusals

Architecture

Overview of the RACE framework, illustrating the interaction between the Shadow Model and the Victim Model within the Attack State Machine structure.

Evaluation Highlights

Achieves Attack Success Rates (ASRs) of up to 96% in complex conversational scenarios across multiple LLMs
Attains 92% ASR against DeepSeek R1, a leading commercial model with strong reasoning capabilities
Attains 82% ASR against OpenAI o1, demonstrating effectiveness against top-tier safety-aligned models

Breakthrough Assessment

8/10

The method effectively targets the 'reasoning' capability of modern LLMs (like o1 and R1) as an attack vector, achieving very high success rates where traditional semantic attacks might fail.

⚙️ Technical Details

Problem Definition

Setting: Black-box self-jailbreaking where a Shadow Model generates queries to induce a Victim Model (instantiated from the same source) to produce harmful responses

Inputs: Harmful target query Q and conversation context C_{i-1}

Outputs: A sequence of queries {q_1, ..., q_n} that elicit an unsafe response r from the victim

Pipeline Flow

Shadow Model (Generate Seed) -> Gain-guided Exploration (Filter) -> Self-play (Refine) -> Victim Model (Interact) -> Rejection Feedback (Recover if needed)

System Modules

Attack State Machine (ASM)

Manages the progression of the attack through defined states (s_0, s_sc, s_fl) and transitions

Model or implementation: Finite State Machine logic

Gain-guided Exploration (GE)

Selects the best seed query by maximizing Information Gain (IG) to prevent semantic drift

Model or implementation: LLM as Probability Estimator

Self-play (SP)

Refines the candidate query by simulating the victim's response and maximizing the utility (probability of non-rejection)

Model or implementation: Shadow Model (M_s) vs Simulated Victim (M_v')

Rejection Feedback (RF)

Analyzes failed interactions (refusals) and regenerates queries using Chain-of-Thought

Model or implementation: Shadow Model with CoT prompting

Novel Architectural Elements

Attack State Machine (ASM) as a reasoning planner for jailbreaking
Gain-guided Exploration module using Information Gain for query selection in attacks
Integration of Self-play for preemptive query refinement in jailbreaking

Modeling

Base Model: Evaluated on OpenAI o1 and DeepSeek R1 (target models acting as both shadow and victim in self-jailbreak)

Comparison to Prior Work

vs. PAIR/TAP: RACE uses an Attack State Machine and reasoning tasks specifically, rather than just semantic obfuscation or persona adoption
vs. GCG: RACE is multi-turn and relies on natural language reasoning logic rather than appending adversarial suffixes
vs. DeepInception [not cited in paper]: DeepInception uses nested scenes to bypass safety; RACE uses explicit reasoning tasks (math/logic) and a state machine controller

Limitations

Relies on the target model's own reasoning capabilities; may be less effective on smaller models with poor reasoning
Computationally intensive due to Information Gain estimation and Self-play simulations at each turn
Assumes the shadow model (attacker) has similar capabilities to the victim model (self-jailbreak setting)

Reproducibility

Code: https://github.com/NY1024/RACE

Code is publicly available at https://github.com/NY1024/RACE. The paper includes the theoretical definitions for Information Gain and the state machine transitions. Specific prompt templates for the Rejection Feedback module are referenced in Section B (Appendix).

📊 Experiments & Results

Evaluation Setup

Black-box multi-turn conversation where the attacker seeks to elicit harmful responses from the victim model

Benchmarks:

Specific benchmark names not listed in snippet (Jailbreak Attack Simulation)

Metrics:

Attack Success Rate (ASR)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

RACE achieves state-of-the-art attack effectiveness, reaching up to 96% Attack Success Rate (ASR) in complex scenarios.
The method is highly effective against reasoning-heavy models, achieving 92% ASR against DeepSeek R1 and 82% against OpenAI o1.
The Attack State Machine framework successfully balances semantic coherence with attack progression, mitigating the semantic drift common in prior multi-turn attacks.
Self-play and Rejection Feedback modules significantly contribute to robustness by preemptively filtering bad queries and recovering from refusals.

📚 Prerequisite Knowledge

Prerequisites

Finite State Machines (FSM)
Information Theory (Entropy and Information Gain)
Game Theory (Self-play, Nash Equilibrium)
Large Language Model Safety Alignment

Key Terms

Jailbreak: Adversarial prompts designed to bypass an AI model's safety restrictions to generate prohibited content

Attack State Machine (ASM): A formal framework modeling the attack process as states (success, failure, ongoing) and transitions driven by reasoning tasks

Information Gain (IG): A metric used to measure how much a query reduces uncertainty about the target response, used here to select effective prompts

Shadow Model: The instance of the LLM used by the attacker to generate and refine queries

Victim Model: The target instance of the LLM that is being attacked to elicit harmful information

Self-play: A strategy where the shadow model simulates the victim's response to optimize queries before the actual interaction

Chain-of-thought (CoT): A prompting technique that encourages the model to articulate intermediate reasoning steps

Semantic Drift: The phenomenon where a multi-turn conversation deviates from the original (harmful) objective towards benign topics