RLVR: Reinforcement Learning from Verifiable Rewards—training models using objective success signals (like passing tests) rather than human preference labels
Pass@k: The probability that at least one correct solution is generated when k samples are produced
SWE-Bench Verified: A benchmark for evaluating software engineering agents, consisting of real-world GitHub issues and verified unit tests
DPO: Direct Preference Optimization—an algorithm that optimizes a language model to prefer winning responses over losing ones without explicitly training a separate reward model
scaffold: The fixed code structure or logic flow that manages the agent's interaction with the environment (e.g., parsing outputs, executing tools)
trajectories: The sequence of actions and observations generated by an agent while attempting to solve a task
instruct-tuning: Supervised fine-tuning (SFT) of a model on instruction-response pairs to teach it to follow directions
Best-of-N: An inference strategy where N solutions are generated and the best one is selected (often by a reward model)
unit tests: Automated code tests that verify if a specific part of the software works as expected; used here as the ground-truth reward signal
SFT: Supervised Fine-Tuning—training a model on labeled examples
KL divergence: A statistical measure of how one probability distribution differs from a second, reference probability distribution; used as a penalty to prevent the model from deviating too much from its initial state