GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of sampled outputs to stabilize training without a separate value network
DPO: Direct Preference Optimization—a method optimizing the policy directly on preference pairs without an explicit reward model
SFT: Supervised Fine-Tuning—initial training on labeled examples to teach the model a specific format or behavior
ASR: Attack Success Rate—the percentage of malicious prompts for which the model generates a harmful response
FRR: False Refusal Rate—the percentage of benign/safe prompts the model incorrectly refuses to answer
Constitutional AI: An alignment method where models are trained using a set of high-level principles (a constitution) to guide behavior
Pareto optimal: A state where no metric can be improved without degrading another; here, maximizing safety without increasing false refusals
LLM-as-Judge: Using a large language model to evaluate and score the outputs of another model
GSM8K: A benchmark dataset of grade-school math word problems used to test reasoning capabilities
RLAIF: Reinforcement Learning from AI Feedback—using AI models rather than humans to generate preference labels or rewards