RLVR: Reinforcement Learning with Verifiable Rewards—a paradigm where the reward signal is determined by an objective, automated check (e.g., code execution or math proof) rather than a learned reward model
RLHF: Reinforcement Learning from Human Feedback—fine-tuning models using rewards derived from human preferences
PPO: Proximal Policy Optimization—a policy gradient algorithm that updates the policy in small, constrained steps to ensure stability
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, eliminating the need for a separate value network
MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker
DQN: Deep Q-Network—a value-based RL algorithm that uses a neural network to approximate the Q-value function
AIME: American Invitational Mathematics Examination—a challenging math competition benchmark used to evaluate reasoning capabilities
GSM8K: Grade School Math 8K—a dataset of grade school math word problems
SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) before applying RL
REINFORCE: A fundamental Monte Carlo policy gradient method that estimates the gradient of expected return
TRPO: Trust Region Policy Optimization—an optimization method that ensures policy updates stay within a specified trust region to prevent performance collapse
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution