PPO: Proximal Policy Optimization—an RL algorithm that uses a clipped surrogate objective to ensure stable policy updates
GRPO: Group Relative Policy Optimization—an RL algorithm used by DeepSeek-R1 that estimates advantages by averaging rewards within a group of samples, avoiding a learned value function
GAE: Generalized Advantage Estimation—a method to estimate the 'advantage' (how good an action is) by balancing bias and variance
KL regularization: Kullback-Leibler divergence penalty—usually added to RL rewards to keep the model close to its initial state; removed in this paper
Reasoner-Zero: A training paradigm where a base LLM is trained via RL directly to reason, without prior Supervised Fine-Tuning (SFT)
Credit assignment: The problem of determining which specific past actions (tokens) contributed to the final reward
Value function: A learned network (Critic) that predicts the expected future reward from a given state
Discount factor: Gamma—a parameter in RL that determines how much future rewards are valued compared to immediate ones