PRM: Process Reward Model—a verifier that scores the correctness of each intermediate step in a chain of reasoning
ORM: Outcome Reward Model—a verifier that scores only the final complete response
Monte Carlo (MC) Estimation: A method to estimate step correctness by running multiple completions from that step and calculating the percentage that reach the correct final answer
Best-of-N (BoN): An evaluation strategy where N solutions are generated, ranked by a reward model, and the top-ranked solution is selected
LLM-as-a-judge: Using a strong Language Model to evaluate the quality or correctness of another model's outputs via prompting
Consensus Filtering: The paper's proposed data cleaning method that keeps only samples where MC estimation and LLM-as-a-judge agree on error locations
ProcessBench: A benchmark specifically designed to evaluate a model's ability to identify the exact step where a reasoning error occurs
Hard labels: Binary training targets (0 or 1) rather than continuous probabilities (soft labels)