DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly from preference data without training an explicit reward model first
RLHF: Reinforcement Learning from Human Feedback—a technique to align models using rewards derived from human preferences
ASR: Attack Success Rate—the percentage of jailbreaking attempts that successfully elicit a harmful response from the model
Self+RM: A data creation strategy where the target model generates both chosen and rejected candidates, ranked by an external Reward Model
Reward Hacking: When a model learns to optimize the reward signal (or loss function) by exploiting flaws or shortcuts (like length or style) rather than achieving the intended goal (safety)
Linear Separability: A measure of how easily a simple linear classifier can distinguish between 'chosen' and 'rejected' examples in the data; high separability implies obvious, likely superficial differences
SFT: Supervised Fine-Tuning—the initial training phase on high-quality instruction data before preference optimization
Jailbreaking: Adversarial attacks designed to bypass an LLM's safety filters and elicit harmful or restricted content