PPO: Proximal Policy Optimization—an RL algorithm that updates policies in small, constrained steps to ensure stability
CoT: Chain of Thought—a prompting strategy where models generate intermediate reasoning steps before the final answer
GAE: Generalized Advantage Estimation—a method to estimate the 'advantage' of an action (how much better it is than average) by balancing bias and variance
SFT: Supervised Fine-Tuning—training a model on labeled examples (prompt-response pairs) before applying RL
GRPO: Group Relative Policy Optimization—a PPO variant that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a learned value model
Value Model: A neural network that predicts the expected future reward from a specific state (sequence of tokens)
AIME: American Invitational Mathematics Examination—a challenging math competition benchmark used to evaluate reasoning capabilities
RLHF: Reinforcement Learning from Human Feedback—fine-tuning LLMs using rewards derived from human or rule-based preferences
Olympiad-level math: Extremely difficult mathematics problems requiring complex, multi-step reasoning, typical of competitions like AIME or IMO