Refusal Trigger: A non-harmful linguistic cue (e.g., 'write a script') found within a harmful query that an aligned model learns to associate with a refusal response
Overrefusal: The failure mode where a safety-aligned model refuses to answer benign/harmless queries
SFT: Supervised Fine-Tuning—training the model on a dataset of input-output pairs to enforce specific behaviors
RLVR: Reinforcement Learning via Verifiable Rewards—an alignment method using rule-based rewards to guide model behavior
P-SFT: Prefilled Supervised Fine-Tuning—a variation of SFT where the model is forced to generate an affirmative prefix before the refusal to prevent superficial alignment
ASR: Attack Success Rate—the percentage of harmful queries the model fails to refuse (lower is safer)
RR: Refusal Rate—the percentage of queries the model refuses to answer (lower is better for benign queries)
Jailbreak: Adversarial prompts designed to bypass an LLM's safety filters