Thinking Mode: A built-in reasoning mechanism where models output step-by-step thinking processes before the final answer (e.g., DeepSeek-R1, OpenAI o1)
ASR: Attack Success Rate—the proportion of successfully jailed-broken samples where the model outputs harmful content
Thinking Collapse: A failure mode where the model's reasoning process contains massive repetitions or hits length limits without producing an answer
TCR: Thinking Collapse Rate—the proportion of thinking instances that exhibit collapse
RRR: Response Repetition Rate—the proportion of outputs containing massive repetitive content
Multi-stream Interleaving: Mixing words from different tasks (harmful vs. benign) into a single sequence using delimiters
Inversion Perturbation: Reversing the character order of words in the benign auxiliary tasks to increase decoding burden
Shape Transformation: Constraining the output format to a triangular shape (i-th line has i characters) to add cognitive load
GCG: Greedy Coordinate Gradient—a white-box attack method optimizing adversarial suffixes
PAIR: Prompt Automatic Iterative Refinement—a black-box attack using an attacker LLM to refine prompts
SFT: Supervised Fine-Tuning—training models on labeled data
RLHF: Reinforcement Learning from Human Feedback—aligning models to human preferences
DPO: Direct Preference Optimization—a stable alternative to RLHF