RLVR: Reinforcement Learning with Verifiable Rewards—RL where the reward signal comes from a deterministic check (e.g., math answer correctness)
Recall phase: The RL training phase where the model optimizes reasoning paths within its existing capabilities, typically reducing entropy/exploration
Extend phase: The SFT training phase where the model learns new reasoning patterns from external teacher data, increasing the explorable space
GRPO: Group Relative Policy Optimization—a policy gradient method that normalizes advantages within a group of sampled outputs for the same query, removing the need for a separate value function
SFT: Supervised Fine-Tuning—training on labeled examples (query, response) using maximum likelihood estimation
Entropy: A measure of uncertainty in the model's predictions; high entropy implies diverse exploration, low entropy implies confidence or collapse
Policy Shift: Adjusting the assumed probability of offline data (pi_offline) in the importance sampling weight based on the model's current performance
Importance Sampling: A technique to estimate properties of a target distribution while sampling from a different distribution, using weights to correct for the difference