PRM: Process Reward Model—a model that evaluates the correctness of each intermediate step in a reasoning chain, rather than just the final answer
MCTS: Monte Carlo Tree Search—a search algorithm that builds a decision tree by simulating future outcomes to find optimal moves
MCTS∗: The paper's modified search algorithm that uses a learned value function for guidance and updates values based on search rollouts
rollout: Simulating a reasoning path from a current state to a final outcome to estimate the value of that state
reasoning distance: The estimated number of steps remaining from the current state to reach a correct solution; used to weight rewards
BoN: Best-of-N—a strategy where the model generates N solutions and a verifier selects the best one
CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before the final answer
ToT: Tree-of-Thought—a prompting strategy that explores multiple reasoning paths in a tree structure
STaR: Self-Taught Reasoner—a self-training method where a model learns from its own correct solutions
SciBench: A benchmark dataset consisting of complex scientific reasoning problems
MATH: A dataset of challenging mathematics problems requiring multi-step reasoning
GSM8K: A dataset of grade school math word problems