RLVR: Reinforcement Learning with Verifiable Rewards—training models using binary pass/fail feedback on final answers (e.g., math problems)
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a separate value model
On-Policy: Learning strictly from data generated by the model's current policy
Off-Policy: Learning from data generated by a different policy (e.g., a stronger teacher model or historical data)
Importance Sampling: A technique to estimate properties of a target distribution using samples from a different proposal distribution by reweighting them
Mixed-Policy: Combining both on-policy rollouts (exploration) and off-policy demonstrations (guidance) in a single training batch or group
Policy Shaping: A proposed regularization technique that transforms importance sampling weights to prevent entropy collapse and encourage exploration of low-probability actions
Entropy Collapse: A failure mode where the model's output distribution becomes deterministic too quickly, reducing exploration and diversity
DeepSeek-R1: A strong reasoning model used in this paper as the source of off-policy guidance traces