Test-Time Scaling (TTS): Improving model performance by increasing computation during inference (e.g., generating more samples or searching deeper) rather than training larger models
Process Reward Model (PRM): A model that evaluates and scores intermediate steps of a reasoning chain, rather than just the final answer
Best-of-N (BoN): A sampling strategy where the model generates N complete solutions, and the best one is selected based on a scoring function
DVTS: Diverse Verifier Tree Search—an extension of beam search that explores independent subtrees to increase solution diversity
RLHFlow: A series of open-source Process Reward Models trained on mathematical reasoning data
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
Pass@1: The probability that the model generates a correct answer in a single attempt
OOD: Out-of-Distribution—when a model encounters data significantly different from what it was trained on