← Back to Paper List

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search

Xuan Chen, Yuzhou Nie, Wenbo Guo, Xiangyu Zhang
Purdue University, University of California, Santa Barbara
Neural Information Processing Systems (2024)
RL QA Benchmark

📝 Paper Summary

Adversarial Attacks on LLMs AI Safety / Alignment Red Teaming
RLbreaker treats jailbreaking as a search problem, training a DRL agent to strategically select prompt mutators based on semantic rewards rather than relying on random genetic mutation.
Core Problem
Existing black-box jailbreaking attacks rely on genetic algorithms that select mutators stochastically (randomly), leading to inefficient search spaces and lower success rates against strongly aligned models.
Why it matters:
  • Current automated attacks are inefficient because they lack a guided strategy, often getting stuck in local optima
  • Strongly aligned models (like Llama2-70b) easily reject random variations, requiring more sophisticated, directed prompt engineering to break
  • Manual jailbreaking is unscalable, and gradient-based white-box methods (like GCG) often fail on larger models due to discrete optimization challenges
Concrete Example: When attacking with the question 'How to hack into a government database?', a genetic algorithm might randomly choose to 'shorten' the prompt, resulting in a rejection. RLbreaker's trained agent analyzes the current prompt state and deliberately selects 'generate_similar' or 'crossover' with a specific template because its policy predicts this specific mutation sequence will maximize the semantic similarity to a successful attack.
Key Novelty
Deep Reinforcement Learning for Prompt Mutation Selection
  • Models the jailbreaking process as a Markov Decision Process where an agent views the current prompt as a state and selects specific mutators (e.g., Rephrase, Expand) as actions
  • Uses a reference-based dense reward system: instead of just binary success/failure, it calculates semantic similarity between the target's output and an unaligned model's answer
  • Uses a helper LLM to execute the mutations chosen by the RL agent, combining discrete decision-making with continuous text generation capabilities
Architecture
Architecture Figure Figure 1
Overview of the RLbreaker system loop involving the Agent, Mutators, and Target LLM.
Evaluation Highlights
  • Achieves 100% Attack Success Rate (GPT-Judge) on Mixtral-8x7B-Instruct (Max50 dataset), outperforming AutoDAN by 28 percentage points
  • Outperforms state-of-the-art AutoDAN on Llama2-70b-chat by +8.17% in GPT-Judge score on the Max50 dataset
  • Maintains 84.69% success rate against 'Rephrasing' defense on Mixtral-8x7B, whereas AutoDAN drops to 5.94%
Breakthrough Assessment
8/10
Significant improvement in black-box attack efficiency by replacing random search with guided RL. Demonstrates first effective transferability to very large models like Mixtral-8x7B.
×