On Randomness in Agentic Evals

📝 Paper Summary

Agentic AI Evaluation Benchmarking methodology

Single-run evaluations of AI agents are statistically unsound due to high variance (2.2–6.0%), requiring multiple runs and new metrics like pessimistic pass^k to distinguish genuine progress from noise.

Core Problem

Standard agentic benchmarks (like SWE-Bench) typically report pass@1 scores from a single run, assuming determinism or negligible variance.

Why it matters:

Reported improvements of 2–3% often fall within the natural noise margin, leading to false claims of algorithmic progress
Deployment decisions affecting millions of users are based on unreliable leaderboards that may reflect lucky seeds rather than model capability
Even at temperature 0, non-determinism in inference engines and environments persists, making single-run scores irreproducible

Concrete Example: In one run, an agent searching for a 'Paginator' class searches a specific file and applies a patch to the wrong location (fail). In a second run of the exact same agent/task, a slight phrasing difference leads it to search the whole directory, find the correct location, and succeed. A single-run eval would randomly report either 0% or 100% for this task.

Key Novelty

Quantification of Evaluation Noise in Agentic Systems

Conducts a large-scale empirical study (60,000 trajectories) to measure the 'noise floor' of agent benchmarks, revealing that single-run scores vary by up to 6 percentage points
Introduces the distinction between optimistic bounds (pass@k) and pessimistic bounds (pass^k) to characterize how much an agent relies on 'luck' (stochastic exploration)
Performs token-level analysis to pinpoint that trajectory divergence happens in the first 1% of tokens, cascading into completely different solution strategies via the butterfly effect

Evaluation Highlights

Single-run pass@1 estimates vary by 2.2 to 6.0 percentage points across runs for the same model-scaffold pair
Even at temperature 0 (greedy decoding), standard deviations exceed 1.5 percentage points due to system-level non-determinism
The gap between optimistic (pass@5) and pessimistic (pass^5) performance reaches up to 24.9 percentage points, showing high dependence on stochasticity

Breakthrough Assessment

7/10

Crucial meta-evaluation paper. While it doesn't propose a new model, it exposes a fundamental flaw in how the entire field measures progress, potentially invalidating many existing 'SOTA' claims.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of autonomous software engineering agents on issue-resolution tasks

Inputs: GitHub issue description and codebase

Outputs: Code patch that passes unit tests

Pipeline Flow

Input Issue -> Agent Loop (Model + Scaffold) -> Environment Interaction -> Trajectory Generation

System Modules

Agent Model

Generates reasoning traces, text responses, and tool calls based on context

Model or implementation: Evaluated: Qwen3-32B, DeepSWE-preview, Devstral-2-123B

Scaffold

Executes tool calls generated by the model and appends results to context

Model or implementation: Evaluated: nano-agent (minimal), R2E-Gym (feature-rich)

Novel Architectural Elements

This paper evaluates existing architectures rather than proposing a new one. The novelty lies in the analysis framework (comparing trajectories across 10 independent runs).

Limitations

Study limited to agentic coding domain (SWE-Bench); findings might differ for other agent tasks (e.g., web browsing)
Does not analyze the effect of context compaction or summarization (uses append-only context)
High cost of proposed evaluation: running 10x evaluations per task significantly increases compute requirements
Analysis focuses on open-weights/available models; closed-source API black boxes (like GPT-4) might have different variance properties

Reproducibility

60,000 trajectories collected. Artifacts (nano-agent scaffold) used are distinct from training scaffolds to ensure independence. Specific versions of models (Qwen3-32B, DeepSWE-preview, Devstral-2) and scaffolds (R2E-Gym) are explicitly listed.

📊 Experiments & Results

Evaluation Setup

Software engineering issue resolution on SWE-Bench-Verified

Benchmarks:

SWE-Bench-Verified (Agentic Coding (Issue Resolution))

Metrics:

Single-run resolution rate (r)
pass@1 (Mean resolution rate)
pass@k (Optimistic bound)
pass^k (Pessimistic bound / Consistency)
First token divergence position
Statistical methodology: Computed mean and standard deviation across 10 independent runs per configuration. Explicitly discusses statistical power analysis.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Standard deviation analysis reveals that single-run scores are highly unreliable, with variance often exceeding the 'improvements' reported in many papers.
SWE-Bench-Verified	pass@1 Range	28.8	32.4	3.6
SWE-Bench-Verified	pass@1 Range	21.4	26.4	5.0
Temperature 0 analysis shows that 'deterministic' settings are a myth in practice due to system-level noise, and variance remains high.
SWE-Bench-Verified	Standard Deviation	0.0	1.0	+1.0
SWE-Bench-Verified	Standard Deviation	0.0	1.8	+1.8
Pass@k vs Pass^k analysis reveals the massive gap between potential performance (luck) and consistent performance (skill).
SWE-Bench-Verified	Score Gap (pass@5 - pass^5)	15.5	52.9	+37.4

Experiment Figures

Comparison of pass@k (optimistic) and pass^k (pessimistic) curves as k increases from 1 to 5.

Distribution of the first token divergence position (where two runs first differ) in absolute tokens and percentage of trajectory.

Main Takeaways

Evaluation noise is non-negligible: A reported 2-3% improvement is statistically indistinguishable from noise when using single-run protocols.
Temperature 0 is not a fix: System-level non-determinism (floating point, parallelization) preserves variance even with greedy decoding.
Butterfly Effect: Trajectories diverge within the first 1% of tokens (often first 10-50 tokens), causing agents to adopt fundamentally different strategies early on.
Recommendation: Researchers must report mean/std over multiple runs (N=10 suggested) and use power analysis to justify sample sizes.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) sampling (temperature)
Familiarity with agentic loops (Observation-Reasoning-Action)
Basic statistics (mean, standard deviation, binomial distribution)

Key Terms

pass@1: The empirical probability that a task is solved in a single attempt, estimated here as the mean resolution rate across multiple runs

pass@k: An optimistic metric estimating the probability that at least one of k attempts succeeds (measures potential)

pass^k: A pessimistic metric estimating the probability that all k attempts succeed (measures robustness/consistency)

scaffold: The software framework wrapping the LLM that handles tool execution, environment interaction, and memory management (e.g., nano-agent, R2E-Gym)

trajectory: The complete linearized sequence of all messages in an agent's run, including user prompts, model reasoning, tool calls, and environment outputs

autoregressive conditioning: The process where an LLM generates the next token based on all previous tokens; small changes early in the sequence can drastically alter future outputs

temperature: A hyperparameter controlling the randomness of LLM output; higher values increase diversity, while 0 is theoretically deterministic (greedy decoding)

SOTA: State-of-the-Art—the current best performing models or methods

SWE-Bench-Verified: A benchmark for evaluating LLMs on real-world software engineering issues derived from GitHub repositories, verified to be solvable