pass@1: The empirical probability that a task is solved in a single attempt, estimated here as the mean resolution rate across multiple runs
pass@k: An optimistic metric estimating the probability that at least one of k attempts succeeds (measures potential)
pass^k: A pessimistic metric estimating the probability that all k attempts succeed (measures robustness/consistency)
scaffold: The software framework wrapping the LLM that handles tool execution, environment interaction, and memory management (e.g., nano-agent, R2E-Gym)
trajectory: The complete linearized sequence of all messages in an agent's run, including user prompts, model reasoning, tool calls, and environment outputs
autoregressive conditioning: The process where an LLM generates the next token based on all previous tokens; small changes early in the sequence can drastically alter future outputs
temperature: A hyperparameter controlling the randomness of LLM output; higher values increase diversity, while 0 is theoretically deterministic (greedy decoding)
SOTA: State-of-the-Art—the current best performing models or methods
SWE-Bench-Verified: A benchmark for evaluating LLMs on real-world software engineering issues derived from GitHub repositories, verified to be solvable