PRM: Process Reward Model—a model that evaluates the correctness of intermediate steps in a reasoning chain.
GenPRM: Generative Process Reward Model—the authors' proposed method that reasons and writes code to verify steps.
RPE: Relative Progress Estimation—a labeling method that compares the Monte Carlo success rate of the current step against the previous step to determine correctness.
Test-Time Scaling: Improving model performance during inference by increasing computational cost (e.g., generating multiple samples and voting).
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps.
SFT: Supervised Fine-Tuning—training a model on labeled examples.
Pass@1: The accuracy when generating a single solution.
Maj@N: Majority voting accuracy over N generated solutions/paths.
Critic: A model role where the LLM provides feedback to refine a generated solution rather than just scoring it.