RLVR: Reinforcement Learning with Verifiable Rewards—using objective correctness signals (like math answers or passing code tests) to guide RL training.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs for the same prompt, removing the need for a separate value network.
SFT: Supervised Fine-Tuning—training a model to mimic provided reference answers (demonstrations).
Reasoning Intensity: A metric (1-5) estimated by an LLM to classify how much multi-step derivation vs. simple knowledge recall a problem requires.
Model-based Verifier: An LLM used as a reward function to judge the correctness of free-form scientific answers where exact string matching fails.
Cold-start: The initial phase of training (usually SFT) used to give a model basic capabilities before starting reinforcement learning.
Distillation: Transferring knowledge from a larger/stronger 'teacher' model to a smaller 'student' model, typically via SFT on teacher outputs.
Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct.