ORM: Outcome Reward Model—a verifier that predicts the correctness of a final complete solution
PRM: Process Reward Model—a verifier that scores intermediate reasoning steps
PAV: Process Advantage Verifier—the proposed model that predicts the 'advantage' (progress) of a step under a prover policy
Prover Policy: A policy (distinct from the base policy) used to estimate the value of states for calculating advantages; serves as the 'judge' of progress
Advantage: The difference between the value of taking a specific action at a state and the average value of that state; measures how much better/worse an action is relative to expectation
Best-of-K: A policy strategy that samples K solutions and selects the best one according to a verifier; used here as a strong 'prover' policy
Pass @ K: The probability that at least one of K generated solutions is correct
SFT: Supervised Fine-Tuning—training on labeled demonstrations
RFT: Rejection Fine-Tuning—fine-tuning on self-generated samples that are verified as correct