MODPO: Multi-Objective Direct Preference Optimization—an algorithm that aligns models to multiple objectives (e.g., safety and helpfulness) simultaneously by adding margin terms to the DPO loss
APRT: Automated Progressive Red Teaming—a method for automatically generating adversarial prompts to test model safety
Weak Supervision: Using noisier or less precise supervision signals (like automated classifier outputs) instead of high-quality human annotations to train models
Safety-Reset: A pre-processing step where a model is fine-tuned on harmful examples to remove existing safety guardrails, establishing a neutral baseline for experimentation
BLEU score: A metric typically used for translation quality, used here to measure similarity between generated attack prompts to ensure diversity
Intention Hiding: A red-teaming strategy where the harmful intent of a prompt is obfuscated to bypass safety filters
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained model weights and injects trainable rank decomposition matrices