PRM: Process Reward Model—a verifier that scores each intermediate step of a solution reasoning chain rather than just the final answer
ORM: Outcome Reward Model—a verifier that scores only the final answer of a solution
Best-of-N: A sampling strategy where N solutions are generated in parallel, and a verifier selects the highest-scoring one
proposal distribution: The probability distribution from which the model generates initial candidate answers (can be modified via revisions)
pass@1: The accuracy of the model when generating a single response
FLOPs-matched: Comparing models/methods by equating the total floating-point operations used, ensuring a fair efficiency comparison
PaLM 2: A large language model developed by Google
STaR: Self-Taught Reasoner—a method where a model iteratively learns from its own correct reasoning chains
MCTS: Monte Carlo Tree Search—a search algorithm that uses random sampling to explore decision trees