Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before producing the final answer
Context Distillation: A training method where a model is prompted with extra information (like a safety spec) to generate data, and then fine-tuned on that data *without* the extra information, effectively 'internalizing' the context
Pareto improvement: An improvement in one metric (e.g., safety) that does not come at the expense of another metric (e.g., helpfulness)
Jailbreak: Adversarial prompts designed to trick a model into bypassing its safety filters and producing harmful content
Over-refusal: When a safety-aligned model incorrectly refuses to answer a harmless or benign user request
SFT: Supervised Fine-Tuning—training a model on a dataset of specific input-output pairs
RL: Reinforcement Learning—training a model to maximize a reward signal
OOD: Out-of-Distribution—scenarios or data that differ significantly from what the model saw during training