RLVR: Reinforcement Learning with Verifiable Rewards—training LLMs using tasks where the final answer can be automatically checked (e.g., math, code)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines by averaging rewards from a group of outputs for the same question, avoiding a separate value network
PPO: Proximal Policy Optimization—an RL algorithm that limits how much the policy changes in one step using a clipped surrogate objective
Entropy: A measure of uncertainty in a probability distribution; high entropy means the model is unsure which token to pick next
Advantage Function: A function in RL that measures how much better a specific action is compared to the average action at that state
Pass@K: A metric evaluating the probability that at least one correct solution is found in K generated attempts
Pivotal Tokens: Connectors like 'therefore', 'however', or 'first' that determine the logical flow of a reasoning chain
Reflective Actions: Self-correction behaviors where the model verifies its own previous steps (e.g., 'Wait, let me check that')
Detached Gradient: Stopping the backpropagation of error signals through a specific term, treating it as a constant during the gradient calculation