PRM: Process Reward Model—a model that assigns a score to each intermediate step of a reasoning chain rather than just the final outcome
ORM: Outcome Reward Model—a model that scores only the final answer of a reasoning chain
KL-regularization: A constraint used in RL to prevent the new policy from deviating too far from the reference (initial) policy, ensuring stability and retaining prior knowledge
Math-Shepherd: A baseline method for automatically labeling process rewards by estimating the probability of reaching a correct answer from a given step
Best-of-N: An inference-time strategy where N solutions are generated, and the one with the highest reward score is selected
Rejection Sampling: A training strategy where samples are generated, filtered/ranked by a reward model, and the best ones are used to fine-tune the policy
CoT: Chain-of-Thought—a prompting strategy that encourages the model to generate intermediate reasoning steps
Soft-max: In this context, a specific aggregation function for rewards derived from the entropy-regularized objective: log(E[exp(reward)])
RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using reward signals derived from human preferences or, in this case, correctness checks