ASR: Attack Success Rate—the percentage of harmful queries for which the model provides a harmful, compliant response instead of a refusal
PTST: Pure Tuning, Safe Testing—the proposed strategy of fine-tuning without safety prompts but including them during inference
DirectHarm4: A new dataset curated by the authors containing 400 harmful queries across 4 categories that tend to elicit high ASRs
AdvBench: A standard benchmark for evaluating LLM safety, consisting of harmful instructions
GSM8K: A dataset of grade school math word problems, used here as a benign fine-tuning task
System Prompt: A special instruction usually prepended to the conversation history to guide the model's behavior (e.g., 'You are a helpful assistant')
Llama 2-Chat: A specific aligned version of the Llama 2 model family, tuned for dialogue and safety
GCG: Greedy Coordinate Gradient—an optimization-based jailbreak attack that finds adversarial suffixes to force models to answer harmful queries