CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
DPO: Direct Preference Optimization—a stable method for aligning language models to preferences without training a separate reward model
MCTS: Monte Carlo Tree Search—a search algorithm that explores decision trees by simulating future outcomes to find optimal paths
SFT: Supervised Fine-Tuning—training a model on a labeled dataset of high-quality examples
VLM: Vision-Language Model—AI models capable of processing and understanding both images and text
System 2 reasoning: A mode of thinking characterized by slow, deliberate, and analytical processing, as opposed to fast, intuitive responses
ASR: Attack Success Rate—the percentage of malicious attempts that successfully cause the model to generate harmful content
UCB: Upper Confidence Bound—an algorithm used in MCTS to balance exploring new possibilities and exploiting known good paths