GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that estimates baselines from the average reward of a group of sampled outputs rather than using a separate critic model
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm that constrains policy updates to prevent instability
Chain-of-Thought: A prompting strategy where the model generates intermediate reasoning steps before the final answer
Program-of-Thought: A reasoning method where the model generates executable code (e.g., Python) to solve the problem
Rejection Sampling Fine-Tuning (RFT): A method where the model generates multiple samples, correct ones are kept, and the model is fine-tuned on these correct samples
DPO: Direct Preference Optimization—an alignment method optimizing policy based on preference pairs without explicit reward modeling
OpenWebMath: A publicly available dataset of high-quality mathematical web text
fastText: A library for efficient text classification and representation learning, used here to filter web pages
KL divergence: A statistical distance measure used as a penalty in RL to prevent the trained model from deviating too far from the reference model