← Back to Paper List

On Randomness in Agentic Evals

Bjarni Haukur Bjarnason, André Silva, Martin Monperrus
KTH Royal Institute of Technology
arXiv (2026)
Agent Benchmark

📝 Paper Summary

Agentic AI Evaluation Benchmarking methodology
Single-run evaluations of AI agents are statistically unsound due to high variance (2.2–6.0%), requiring multiple runs and new metrics like pessimistic pass^k to distinguish genuine progress from noise.
Core Problem
Standard agentic benchmarks (like SWE-Bench) typically report pass@1 scores from a single run, assuming determinism or negligible variance.
Why it matters:
  • Reported improvements of 2–3% often fall within the natural noise margin, leading to false claims of algorithmic progress
  • Deployment decisions affecting millions of users are based on unreliable leaderboards that may reflect lucky seeds rather than model capability
  • Even at temperature 0, non-determinism in inference engines and environments persists, making single-run scores irreproducible
Concrete Example: In one run, an agent searching for a 'Paginator' class searches a specific file and applies a patch to the wrong location (fail). In a second run of the exact same agent/task, a slight phrasing difference leads it to search the whole directory, find the correct location, and succeed. A single-run eval would randomly report either 0% or 100% for this task.
Key Novelty
Quantification of Evaluation Noise in Agentic Systems
  • Conducts a large-scale empirical study (60,000 trajectories) to measure the 'noise floor' of agent benchmarks, revealing that single-run scores vary by up to 6 percentage points
  • Introduces the distinction between optimistic bounds (pass@k) and pessimistic bounds (pass^k) to characterize how much an agent relies on 'luck' (stochastic exploration)
  • Performs token-level analysis to pinpoint that trajectory divergence happens in the first 1% of tokens, cascading into completely different solution strategies via the butterfly effect
Evaluation Highlights
  • Single-run pass@1 estimates vary by 2.2 to 6.0 percentage points across runs for the same model-scaffold pair
  • Even at temperature 0 (greedy decoding), standard deviations exceed 1.5 percentage points due to system-level non-determinism
  • The gap between optimistic (pass@5) and pessimistic (pass^5) performance reaches up to 24.9 percentage points, showing high dependence on stochasticity
Breakthrough Assessment
7/10
Crucial meta-evaluation paper. While it doesn't propose a new model, it exposes a fundamental flaw in how the entire field measures progress, potentially invalidating many existing 'SOTA' claims.
×