Jailbreak: Adversarial prompts designed to bypass an AI model's safety restrictions to generate prohibited content
Attack State Machine (ASM): A formal framework modeling the attack process as states (success, failure, ongoing) and transitions driven by reasoning tasks
Information Gain (IG): A metric used to measure how much a query reduces uncertainty about the target response, used here to select effective prompts
Shadow Model: The instance of the LLM used by the attacker to generate and refine queries
Victim Model: The target instance of the LLM that is being attacked to elicit harmful information
Self-play: A strategy where the shadow model simulates the victim's response to optimize queries before the actual interaction
Chain-of-thought (CoT): A prompting technique that encourages the model to articulate intermediate reasoning steps
Semantic Drift: The phenomenon where a multi-turn conversation deviates from the original (harmful) objective towards benign topics