RLVR: Reinforcement Learning with Verifiable Reward—training models using outcomes (like correct/incorrect answers) to guide learning, often encouraging long reasoning chains
ThinkingFree: An operation that transforms a query by appending a token sequence (like </think>) to explicitly discard the thinking/reasoning generation phase, forcing direct answer generation
TFPI: Thinking-Free Policy Initialization—a proposed training stage where the model is optimized using RL on ThinkingFree-transformed queries before standard long-CoT RL
DAPO: A specific RLVR algorithm (variant of GRPO) used in this paper that enables dynamic sampling and clipping
CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer
LRM: Large Reasoning Model—LLMs specifically trained (often via RL) to perform complex reasoning tasks
GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on the relative advantage of a group of outputs for the same input, removing the need for a critic model
SFT: Supervised Fine-Tuning—training a model on labeled examples
pass@1: The probability that a single generated solution is correct
rollout: The process of generating model responses during RL training to estimate rewards and gradients
AIME: American Invitational Mathematics Examination—a challenging math competition benchmark
GPQA: A challenging multi-task reasoning benchmark (Graduate-Level Google-Proof Q&A)
LiveCodeBench: A benchmark for evaluating code generation capabilities