PRM: Process Reward Model—a model that scores each individual step of a reasoning chain rather than just the final answer
ORM: Outcome Reward Model—a model that scores the entire generated solution based on whether the final answer is correct
PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to finetune LLMs, here applied step-by-step using PRM rewards
Hard Estimation: A binary labeling strategy where a step is labeled '1' if *any* generated completion leads to the correct answer, and '0' otherwise
Soft Estimation: A continuous labeling strategy where a step's label is the *fraction* of generated completions that reach the correct answer
RFT: Rejective Sampling Fine-Tuning—a method where the model is fine-tuned on its own correct outputs
GSM8K: Grade School Math 8K—a benchmark dataset of grade-school level math word problems
MATH: Mathematics Dataset—a challenging dataset of competition-level math problems
Self-Consistency: A verification method that samples multiple reasoning paths and selects the answer that appears most frequently (majority voting)
Completer: A language model used to generate full reasoning paths starting from a specific intermediate step to check if that step can lead to the correct answer