CRL: Curriculum Reinforcement Learning—training an agent on a sequence of tasks ordered by increasing difficulty.
API: Approximate Policy Iteration—a theoretical framework for analyzing RL algorithms that alternate between estimating value functions and updating policies.
Gaussian Scheduling: A proposed task sampling method where the probability of selecting a task difficulty follows a Gaussian distribution that shifts its mean from easy to hard over training steps.
SFT: Supervised Fine-Tuning—training a model to imitate fixed input-output examples.
MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker.
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.
Reward Hacking: When a model learns to exploit flaws in the reward function (e.g., giving short, trivial answers) rather than solving the actual task.
DeepSeek-R1: A recent family of reasoning models trained via reinforcement learning that served as inspiration for this work.