LLM-as-a-Judge: Using an LLM to evaluate the outputs or inputs of another system, in this case acting as a safety classifier
CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before producing a final answer to improve accuracy
SFT: Supervised Fine-Tuning—training a model on a dataset of input-output pairs to teach it a specific task or format
DPO: Direct Preference Optimization—an alignment method that optimizes a model to prefer one response over another without needing a separate reward model
KTO: Kahneman-Tversky Optimization—an alignment method using a loss function based on prospect theory, requiring only binary good/bad labels rather than paired preferences
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the main model weights and trains small adapter matrices
Jailbreak: Adversarial prompts designed to trick an LLM into bypassing its safety constraints (e.g., 'Do Anything Now' prompts)
ADR: Attack Detection Ratio—the percentage of malicious inputs correctly identified as violations by the guardrail
FPR: False Positive Rate—the percentage of safe inputs incorrectly flagged as malicious violations