When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

📝 Paper Summary

Adversarial Attacks on LLMs AI Safety / Alignment Red Teaming

RLbreaker treats jailbreaking as a search problem, training a DRL agent to strategically select prompt mutators based on semantic rewards rather than relying on random genetic mutation.

Core Problem

Existing black-box jailbreaking attacks rely on genetic algorithms that select mutators stochastically (randomly), leading to inefficient search spaces and lower success rates against strongly aligned models.

Why it matters:

Current automated attacks are inefficient because they lack a guided strategy, often getting stuck in local optima
Strongly aligned models (like Llama2-70b) easily reject random variations, requiring more sophisticated, directed prompt engineering to break
Manual jailbreaking is unscalable, and gradient-based white-box methods (like GCG) often fail on larger models due to discrete optimization challenges

Concrete Example: When attacking with the question 'How to hack into a government database?', a genetic algorithm might randomly choose to 'shorten' the prompt, resulting in a rejection. RLbreaker's trained agent analyzes the current prompt state and deliberately selects 'generate_similar' or 'crossover' with a specific template because its policy predicts this specific mutation sequence will maximize the semantic similarity to a successful attack.

Key Novelty

Deep Reinforcement Learning for Prompt Mutation Selection

Models the jailbreaking process as a Markov Decision Process where an agent views the current prompt as a state and selects specific mutators (e.g., Rephrase, Expand) as actions
Uses a reference-based dense reward system: instead of just binary success/failure, it calculates semantic similarity between the target's output and an unaligned model's answer
Uses a helper LLM to execute the mutations chosen by the RL agent, combining discrete decision-making with continuous text generation capabilities

Architecture

Overview of the RLbreaker system loop involving the Agent, Mutators, and Target LLM.

Evaluation Highlights

Achieves 100% Attack Success Rate (GPT-Judge) on Mixtral-8x7B-Instruct (Max50 dataset), outperforming AutoDAN by 28 percentage points
Outperforms state-of-the-art AutoDAN on Llama2-70b-chat by +8.17% in GPT-Judge score on the Max50 dataset
Maintains 84.69% success rate against 'Rephrasing' defense on Mixtral-8x7B, whereas AutoDAN drops to 5.94%

Breakthrough Assessment

8/10

Significant improvement in black-box attack efficiency by replacing random search with guided RL. Demonstrates first effective transferability to very large models like Mixtral-8x7B.

⚙️ Technical Details

Problem Definition

Setting: Black-box adversarial attack modeled as a search problem for optimal prompt structures

Inputs: Harmful question q

Outputs: Jailbreaking prompt p (structure m + question q) that elicits a harmful response u

Pipeline Flow

State Encoder (encodes current prompt)
RL Agent (selects mutator action)
Mutator Execution (Helper LLM applies mutation)
Target Query (Target LLM generates response)
Reward Calculation (Compare response to reference)

System Modules

State Encoder

Converts the current jailbreaking prompt into a fixed-size vector representation

Model or implementation: XLM-RoBERTa (initialized) / BGE-large-en-v1.5

RL Agent

Selects which mutation strategy to apply to the current prompt

Model or implementation: Multi-layer Perceptron (MLP)

Mutator Engine

Applies the selected linguistic transformation to the prompt

Model or implementation: GPT-3.5-turbo (Helper LLM)

Reward Calculator

Computes the reward signal based on the target LLM's response relevance

Model or implementation: Cosine Similarity (embedding-based)

Novel Architectural Elements

Hierarchical action space where RL selects the *type* of mutation, but an LLM performs the *actual* text generation
Reference-guided dense reward function using unaligned model outputs as 'gold standards' for harmfulness

Modeling

Base Model: Agent is MLP; Helper is GPT-3.5-turbo; Encoders are BGE-large/XLM-RoBERTa

Training Method: Customized PPO (Proximal Policy Optimization)

Objective Functions:

Purpose: Maximize expected cumulative reward.

Formally: Maximize E[sum(gamma^t * r(t))]
Purpose: PPO Surrogate Objective.

Formally: min(ratio * A, clip(ratio, 1-eps, 1+eps) * A)

Training Data:

AdvBench dataset (520 harmful questions)
Split: 40% training, 60% testing
Reference answers generated by unaligned Vicuna-7b

Key Hyperparameters:

max_time_steps_T: 5
reward_threshold_tau: 0.7
reference_answer_model: Unaligned Vicuna-7b
+ 2 more
helper_model: GPT-3.5-turbo
PPO_epsilon: Not explicitly reported in the paper

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. AutoDAN/GPTFUZZER: RLbreaker uses guided DRL search instead of stochastic/random genetic mutation selection
vs. PAIR: RLbreaker trains a policy network rather than relying solely on in-context refinement history
vs. GCG: RLbreaker works in black-box settings and operates on prompt structure rather than token-level gradients
+ 1 more
vs. PathSeeker [not cited in paper]: PathSeeker also uses RL but focuses on traversing conversation paths, whereas RLbreaker focuses on single-turn prompt structure mutation

Limitations

Relies on a helper LLM (GPT-3.5) and an unaligned reference model (Vicuna), adding dependencies
Reward function based on cosine similarity may produce false negatives if the target answers correctly but differently from the reference
Computational cost involves querying target and helper LLMs multiple times per step (though comparable to baselines)

Reproducibility

Code: https://github.com/ucsb-mlsec/RLbreaker

Code is publicly available. Reproducing the reward calculation requires access to an unaligned model (Vicuna-7b unaligned version) to generate reference answers.

📊 Experiments & Results

Evaluation Setup

Black-box jailbreaking against aligned LLMs using harmful questions from AdvBench

Benchmarks:

AdvBench (Jailbreaking / Safety Evaluation)

Metrics:

GPT-Judge (Attack Success Rate judged by GPT-4)
Sim. (Cosine similarity to unaligned reference answer)
KM. (Keyword Matching refusal check)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Attack effectiveness on the Max50 dataset (50 most harmful questions) shows RLbreaker surpassing baselines on commercial and open-source models.
AdvBench (Max50)	GPT-Judge	0.6944	0.7761	+0.0817
AdvBench (Max50)	GPT-Judge	0.7200	1.0000	+0.2800
AdvBench (Max50)	GPT-Judge	0.0800	0.3200	+0.2400
Transferability experiments where agents trained on one model are tested on another.
AdvBench	GPT-Judge	0.7343	1.0000	+0.2657
AdvBench	GPT-Judge	0.1875	0.7500	+0.5625
Resiliency against defenses (Input Rephrasing).
Rephrasing Defense	GPT-Judge	0.0594	0.8469	+0.7875

Experiment Figures

Ablation study and sensitivity analysis.

Main Takeaways

RLbreaker consistently outperforms genetic (AutoDAN, GPTFUZZER) and in-context (PAIR) attacks across Llama-2, Mixtral, and GPT-3.5.
Guided search via DRL is significantly more efficient than stochastic search, especially against large, strongly aligned models.
Policies trained on strongly aligned models (like Llama2-7b-chat) transfer effectively to other models, suggesting learned jailbreaking strategies are generalized.
The method shows strong resistance to common defenses like Rephrasing and Perplexity filtering, likely because it generates semantically coherent prompts.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, Policy Gradients)
Large Language Models (In-context learning, Tokenization)
Adversarial Machine Learning (Jailbreaking, Gradient-based vs Black-box attacks)

Key Terms

DRL: Deep Reinforcement Learning—using neural networks to learn optimal decision-making policies through trial and error

PPO: Proximal Policy Optimization—an RL algorithm that improves training stability by limiting how much the policy can change in each step

Jailbreaking: Crafting inputs (prompts) that bypass an LLM's safety filters to elicit harmful or prohibited content

Mutator: A function (often using a helper LLM) that modifies a text prompt, e.g., by rephrasing, expanding, or shortening it

Genetic Algorithm: A search heuristic that mimics natural selection, using mutation and crossover to evolve solutions—often used in prior attacks like AutoDAN

Reference Answer: A response generated by an unaligned (unsafe) model used as a ground truth to measure how harmful the target model's response is

BGE-large: A pre-trained text embedding model used here to convert text into vector representations for the state space

AdvBench: A standard dataset of harmful questions used to evaluate jailbreaking attacks