VGS: Value-Guided Search—a search method where a value model scores partial generation blocks to guide the beam search process
PRM: Process Reward Model—a model that scores intermediate steps of reasoning rather than just the final outcome
ORM: Outcome Reward Model—a model that predicts the final reward (correctness) of a complete response
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer
TTC: Test-Time Compute—techniques to improve model performance during inference (e.g., by sampling more answers or searching), rather than during training
WMV: Weighted Majority Vote—an aggregation method where answers are voted on, but each vote is weighted by the model's confidence/value score
BoN: Best-of-N—a sampling strategy where N solutions are generated and the one with the highest reward model score is selected
Roll-out: Generating a completion from a specific point in the reasoning trace to estimate the value of that state
Roll-in: The partial reasoning trace generated up to the current point
Block-wise search: Search strategy that generates and scores chunks of tokens (blocks) at a time, rather than sentence-by-sentence or token-by-token