PRM: Process Reward Model—a model trained to score the correctness of each intermediate step in a reasoning chain, rather than just the final answer
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated for the same input, avoiding the need for a separate value function
PS-GRPO: Process-Supervised Group Relative Policy Optimization—the authors' proposed RL method that uses PRM 'drop-moments' to penalize inconsistent reasoning paths even if the final answer is correct
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer
TTS: Test-Time Scaling—improving performance during inference by generating multiple solutions and selecting the best one (e.g., using a PRM)
Drop-moment: A specific step in a reasoning chain where the PRM's predicted correctness score drops significantly compared to the previous step, indicating a potential error
MCTS: Monte Carlo Tree Search—a search algorithm used here to estimate the correctness probability of reasoning steps by simulating multiple future outcomes
SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs