RLVR: Reinforcement Learning from Verifier Reward—using a binary correctness signal (e.g., correct answer) to train models via RL
LRMs: Large Reasoning Models—LLMs specifically optimized to generate long reasoning trajectories (thoughts) before answering
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled responses to the same prompt, removing the need for a critic model
z-score: A statistical measure describing a value's relationship to the mean of a group, measured in terms of standard deviations
Overthinking: The phenomenon where reasoning models generate excessively long, meandering, or repetitive thought processes that do not contribute to accuracy
SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning
Test-time Scaling: The observation that allowing models to think longer (generate more tokens) during inference improves performance on complex tasks
Reinforce: A basic policy gradient RL algorithm that updates model weights based on the return (reward) of sampled trajectories