GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes advantages within a group of samples to stabilize training without a separate value function
Chain-of-Thought: A prompting or training technique where the model generates intermediate reasoning steps before producing a final answer
Test-time compute: The amount of computational resources (time, tokens, or parallel samples) used during inference to improve output quality
ELO rating: A rating system calculated from pairwise win/loss records (originally for chess) used here to rank multiple model responses
Best-of-N: An inference strategy where N candidate responses are generated, and a reward model selects the best one
Scalar Reward Model: A standard reward model architecture that outputs a single numerical score for a response, usually via a regression head
Generative Reward Model: A reward model that outputs text (like a judge) rather than just a number, allowing for reasoning or explanation
DeepSeek-R1: A strong reasoning model family used as the initialization checkpoint for the RRMs in this paper