RLVR: Reinforcement Learning with Verifiable Rewards—a paradigm where models learn from outcomes (correct/incorrect) rather than human preference labels
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average of multiple samples for the same prompt
Online Curriculum Learning: A method where the difficulty of training data is assessed and adjusted in real-time based on the model's current performance
Adaptive Problem Restructuring: The process of modifying training problems (simplifying or diversifying) based on their assessed difficulty to improve learning utility
pass@1: The percentage of problems where the model generates the correct answer on its first attempt
KL regularization: A penalty term in the loss function that prevents the model's policy from drifting too far from a reference policy (usually the initial model)
Rollout: The process of the model generating sequences (reasoning paths and answers) for a given set of prompts during RL training
Verifier: A deterministic function or system that checks if the model's final answer matches the ground truth