RLVR: Reinforcement Learning with Verifiable Rewards—RL where rewards are binary and determined by a deterministic verifier (e.g., math answer checker)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same prompt, removing the need for a separate value network
Reward Sparsity: The phenomenon where the agent rarely receives a positive reward (e.g., always gets the answer wrong), making it difficult to learn
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
OOD: Out-of-Distribution—data samples that differ significantly from the training distribution
SFT: Supervised Fine-Tuning—training the model to imitate correct reference outputs
NuminaMath-1.5: A large-scale dataset of competition-level mathematics problems
Imitation Learning: Learning by mimicking a supervisor's demonstrated actions or traces