RLVR: Reinforcement Learning with Verifiable Rewards—optimizing LLMs using binary correctness signals (e.g., correct answer in math/code)
Pass@k: The probability that at least one solution is correct when k solutions are sampled from the model
Negative Interference: The phenomenon where learning to solve a specific set of training problems reduces the likelihood of generating correct solutions for other problems
Winner-take-all: A dynamic where the model reinforces only the most probable solution strategies (the 'winners') and suppresses diverse but initially less probable valid strategies
On-policy sampling: Generating training data using the current version of the model policy, which biases learning toward what the model already knows well
Plasticity loss: The loss of a neural network's ability to learn new things or adapt to new distributions over time
PPO: Proximal Policy Optimization—a standard RL algorithm that constrains policy updates to be close to the previous policy
GRPO: Group Relative Policy Optimization—a PPO variant often used for reasoning that normalizes advantages within a group of sampled outputs for the same prompt
Perplexity: A measurement of how well a probability model predicts a sample; lower perplexity means the model is more confident
KL regularization: A penalty term added to the loss function to prevent the learned policy from diverging too far from a reference policy (usually the base model)