GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that eliminates the critic model by using the average reward of a group of sampled outputs as the baseline
PPO: Proximal Policy Optimization—a standard RL algorithm that uses a value function (critic) to stabilize policy updates
DeepSeekMath Corpus: A 120B token dataset of mathematical web pages mined from Common Crawl using an iterative fastText classifier
RFT: Rejection Sampling Fine-Tuning—fine-tuning a model on its own best (correct) outputs
fastText: A library for efficient text classification and representation learning, used here to filter web pages
chain-of-thought: Prompting technique where the model generates intermediate reasoning steps before the final answer
program-of-thought: Prompting technique where the model generates executable code to solve the problem
DPO: Direct Preference Optimization—optimizing policy to adhere to preferences without explicit reward modeling
Minerva: A large closed-source PaLM-based model fine-tuned on mathematical content
OpenWebMath: An open-source dataset of mathematical web pages, used as a seed for DeepSeekMath Corpus
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from another, used as a penalty in RL to prevent model drift
critic model: In RL, a model that estimates the value (expected future reward) of a state; GRPO removes this component