RLVR: Reinforcement Learning with Verifiable Rewards—using objective, programmatic feedback (correct/incorrect) to train models, typically for math or code, rather than human preference labels.
Procedural Generation: Algorithmic creation of data where content is generated automatically based on parameters rather than manually authored.
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy based on the relative performance of a group of outputs for the same input.
Zero-shot: Evaluating a model's performance on a task without providing any specific examples of that task in the prompt.
Curriculum Learning: Training strategy where the difficulty of tasks increases progressively as the model improves, rather than random sampling.
ARC: Abstraction and Reasoning Corpus—a benchmark requiring the solution of visual logic puzzles, often challenging for text-only models.
Outcome-based feedback: Reward signals based solely on whether the final answer is correct, without evaluating the intermediate reasoning steps.
GSM8K: A benchmark of grade-school math word problems.
MMLU-Pro: A massive multitask benchmark covering diverse academic and professional subjects.