PRM: Process Reward Model—a model that scores intermediate reasoning steps rather than just the final answer
ORM: Outcome Reward Model—a model that scores only the final result of a reasoning chain
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
Q-value: The expected cumulative reward (here, probability of success) of taking a specific action in a specific state
BCE: Binary Cross-Entropy—a loss function commonly used for classification tasks (correct vs. incorrect)
Best-of-N: A sampling strategy where N solutions are generated, and the one with the highest reward model score is selected as the final answer
Comparative Loss: A loss function that trains a model to rank pairs of items correctly (e.g., Step A > Step B) rather than scoring them independently
MATH500: A subset of the MATH benchmark consisting of 500 challenging mathematics problems