CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer
MCTS: Monte Carlo Tree Search—a heuristic search algorithm that expands the most promising moves (reasoning steps) using random sampling and reward backpropagation
SI-MCTS: Safety-Informed MCTS—a variant of MCTS proposed here where rewards explicitly combine helpfulness and safety scores to guide the search for safe reasoning paths
DPO: Direct Preference Optimization—a stable alternative to PPO that optimizes the policy directly on preference pairs without an explicit reward model loop
Step-level DPO: Applying DPO to pairs of individual reasoning steps rather than full responses, providing denser supervision
PRM: Process Reward Model—a reward model trained to evaluate intermediate reasoning steps, used to guide search during inference
StrongReject: A stringent benchmark for evaluating jailbreak resistance, measuring how often models refuse harmful queries disguised by attacks
AlpacaEval: A benchmark measuring general helpfulness and instruction-following capability by comparing model outputs to a reference (usually GPT-4)
BoN: Best-of-N—an inference strategy where N candidates are generated and the best one is selected by a reward model
System 2 thinking: Deliberate, slow, and analytical reasoning process, contrasted with System 1 (fast, instinctive)
System 1 thinking: Fast, automatic, and instinctive processing (like standard safety training's direct refusal)
Self-rewarding: The model evaluates its own outputs to generate training signals, removing the need for an external reward model or human labels during data generation