RLVR: Reinforcement Learning with Verifiable Reward—using objective correctness (like a math answer) as the reward signal for RL training.
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt, removing the need for a separate value function critic.
1-shot RLVR: The proposed method of applying RLVR using a dataset consisting of exactly one training example repeated many times.
post-saturation generalization: The phenomenon where the model's performance on test data continues to improve even after it has achieved 100% accuracy on the training data and training loss has stabilized.
historical variance score: A metric used to select the training example, calculated as the variance of the training accuracy over epochs when the model is trained on the full dataset.
policy gradient loss: The component of the loss function that encourages the model to increase the probability of high-reward actions (correct answers) and decrease low-reward ones.
grokking: A phenomenon where generalization suddenly occurs long after training accuracy saturates, typically driven by weight decay regularization (distinct from the mechanism here).
DeepScaleR: A recent dataset and method for scaling reasoning capabilities; the paper uses a subset of its data as a baseline.
entropy loss: A regularization term added to the loss function to encourage the model to maintain diversity in its outputs (exploration), preventing premature convergence to a single solution.
format reward: A reward given simply for adhering to a specific output format (e.g., boxing the final answer), regardless of correctness.