PRM: Process Reward Model—a model that evaluates the correctness of each intermediate step in a reasoning chain, rather than just the final answer
ORM: Outcome Reward Model—a model that evaluates reasoning based only on the final result
DPO: Direct Preference Optimization—a method to align models to preferences by optimizing a policy directly on preference pairs without a separate reward model
Pass@1: The accuracy of a model when it generates a single solution per problem
Best-of-N: An inference strategy where the model generates N solutions and a reward model selects the best one
MCTS: Monte Carlo Tree Search—a search algorithm used to explore reasoning paths, often used to estimate step values in PRMs
SFT: Supervised Fine-Tuning—training a model on labeled examples
PRM800K: A large-scale dataset of human-annotated reasoning steps for mathematical problems
ProcessBench: A benchmark dataset designed to evaluate a model's ability to identify the first error in a mathematical reasoning chain
LLM-as-a-judge: Using a strong Large Language Model to evaluate the outputs of other models
inference-time scaling: Improving model performance during generation (inference) by using more compute, such as sampling multiple paths or verifying steps