LRM: Large Reasoning Model—LLMs optimized for complex reasoning tasks, often producing long 'thought' outputs before the final answer (e.g., DeepSeek-R1, OpenAI o1)
SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset to adapt it to specific instructions or behaviors
Jailbreak: Adversarial prompts designed to bypass a model's safety filters, often by role-playing or framing harmful requests as hypothetical scenarios
PAIR: Prompt Automatic Iterative Refinement—an automated method for generating jailbreak attacks
PAP: Persuasive Adversarial Prompts—a jailbreak strategy that uses persuasion techniques to convince the model to comply
Reasoning Trajectory: The sequence of intermediate thinking steps generated by an LRM before producing a final answer
Distillation: The process of training a smaller or target model using outputs (data) generated by a larger or more capable 'teacher' model
StrongREJECT: A benchmark for evaluating the safety of LLMs against harmful queries and jailbreak attacks