DRL: Deep Reinforcement Learning—using neural networks to learn optimal decision-making policies through trial and error
PPO: Proximal Policy Optimization—an RL algorithm that improves training stability by limiting how much the policy can change in each step
Jailbreaking: Crafting inputs (prompts) that bypass an LLM's safety filters to elicit harmful or prohibited content
Mutator: A function (often using a helper LLM) that modifies a text prompt, e.g., by rephrasing, expanding, or shortening it
Genetic Algorithm: A search heuristic that mimics natural selection, using mutation and crossover to evolve solutions—often used in prior attacks like AutoDAN
Reference Answer: A response generated by an unaligned (unsafe) model used as a ground truth to measure how harmful the target model's response is
BGE-large: A pre-trained text embedding model used here to convert text into vector representations for the state space
AdvBench: A standard dataset of harmful questions used to evaluate jailbreaking attacks