SFT: Supervised Fine-Tuning—training a model on labeled examples (instruction-response pairs) to teach it how to follow instructions
DPO: Direct Preference Optimization—a method that aligns language models to human preferences by directly optimizing on preference pairs rather than training a separate reward model
Red Teaming: The practice of simulating adversarial attacks on a system (like an AI model) to discover vulnerabilities and safety flaws
PyRIT: Python Risk Identification Toolkit—an open-source automation framework by Microsoft for generating adversarial prompts and scoring model responses
Crescendo: A multi-turn jailbreak strategy where an attacker starts with benign questions and gradually escalates to harmful requests to bypass safety filters
IPRR: Inappropriate Prompt Refusal Rate—measures how often a model correctly refuses to answer harmful prompts (higher is better)
VPRR: Valid Prompt Refusal Rate—measures how often a model incorrectly refuses to answer safe/innocuous prompts (lower is better)
Ungroundedness: A metric measuring how much a model's response relies on information not present in the provided context (hallucination)