RLVR: Reinforcement Learning with Verifiable Reward—training LLMs using binary rewards based on whether the final answer is correct
Exploration bottleneck: A situation in RL where the agent rarely or never discovers a high-reward action (correct solution), preventing it from learning
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of outputs for the same input to reduce variance
SFT: Supervised Fine-Tuning—training a model on labeled examples of inputs and desired outputs
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
Rollout: A single execution of the current policy (model) to generate a full response from a prompt
Pass@k: A metric measuring the probability that at least one of k generated samples is correct
Curriculum Learning: A training strategy where the model learns from easy examples before progressing to harder ones
Distillation: Training a smaller 'student' model to mimic the behavior or outputs of a larger 'teacher' model
Sparse rewards: When the agent receives non-zero rewards very infrequently, making it difficult to determine which actions led to success