GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards across a group of outputs generated from the same prompt, avoiding the need for a separate value network
Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct
Test Loss: Defined in this paper as 1 - (Correct Answers / Total Answers), serving as a proxy for RL reward minimization
FLOPs: Floating Point Operations per Second—a measure of computational cost
CoT: Chain-of-Thought—a prompting strategy that encourages models to generate intermediate reasoning steps
Data Reuse: The strategy of training on the same data samples multiple times (epochs) rather than using new unique samples
Saturation: The phenomenon where increasing model size yields diminishing improvements in learning efficiency
Learning Efficiency k(N): A term in the paper's power-law equation representing how effectively a model of size N converts compute/data into loss reduction
VeRL: A large-scale Reinforcement Learning framework for LLMs used to run the experiments