RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using RL where the reward is a binary correctness check (e.g., math answer is correct).
forking tokens: High-entropy tokens that act as decision points in a reasoning chain, branching into different potential logical paths.
DAPO: Dynamic sAmpling Policy Optimization—an RLVR algorithm that modifies GRPO by removing the KL penalty and adding clip-higher mechanisms and token-level loss masking.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average, removing the need for a critic network.
PPO: Proximal Policy Optimization—a standard RL algorithm that limits policy updates to a small region to ensure stability.
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.
entropy: A measure of uncertainty in the model's next-token prediction distribution; high entropy means the model considers many possible next tokens.
token entropy: The entropy of the distribution at time t, calculated using the model's logits, not the specific sampled token.
AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for reasoning models.