MCTS: Monte Carlo Tree Search—a heuristic search algorithm that expands the most promising moves based on random sampling of the search space.
DPO: Direct Preference Optimization—a method to fine-tune language models to align with human preferences without explicitly training a reward model first.
MCTSG: MCTS with Global selection—a modification of MCTS proposed in this paper that selects nodes from all available leaf nodes based on value distribution, rather than just local children.
UCB: Upper Confidence Bound—a formula used in search algorithms to balance exploration (trying less-visited paths) and exploitation (using high-reward paths).
SC: Self-Consistency—a technique where the model generates multiple reasoning paths and selects the most frequent answer as the final output.
STILL-1: Slow Thinking with LLMs—the specific implementation of the reasoning framework presented in this paper.
ORM: Outcome-based Reward Model—a model trained to predict the correctness of the final answer rather than individual steps.
PRM: Process-based Reward Model—a model trained to evaluate the correctness of intermediate reasoning steps.