DPO: Direct Preference Optimization—an algorithm that optimizes a policy directly from preference pairs without explicitly training a reward model in the loop (unlike PPO)
PRM: Process Reward Model—a model trained to assign scores to intermediate steps of reasoning, rather than just the final outcome
MCTS: Monte Carlo Tree Search—a heuristic search algorithm that expands the most promising moves by simulating outcomes
Reasoning-as-Planning (RAP): Modeling the generation of reasoning steps as a planning problem (like chess) where future states are explored before committing
Offline Simulation: Running rollouts (simulations) during the training/data collection phase to estimate values, rather than during live inference
ReAct: Reasoning and Acting—a paradigm where LLMs generate reasoning traces and task-specific actions in an interleaved manner
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm often used for RLHF, known to be sometimes unstable or resource-intensive