← Back to Paper List

Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation

Zairah Mustahsan, Abel Lim, Megna Anand, Saahil Jain, Bryan McCann
You.com
arXiv (2025)
Agent Benchmark Factuality Reasoning

📝 Paper Summary

Agentic Benchmarking Evaluation Reliability Stochasticity in LLMs
The paper proposes using Intraclass Correlation Coefficient (ICC) to decompose agent evaluation variance into task difficulty versus agent inconsistency, ensuring reported improvements reflect true capability rather than lucky sampling.
Core Problem
Current agentic benchmarks typically report a single accuracy number from one run, obscuring critical variance caused by model stochasticity, API instability, and prompt ambiguity.
Why it matters:
  • Unreliable sub-agents introduce brittleness into larger downstream systems
  • Without measuring variance, it is impossible to distinguish genuine capability improvements from random noise (lucky sampling)
  • Comparisons between agents risk overstating differences when standard errors and confidence intervals are ignored
Concrete Example: Per-question accuracy estimates on GAIA show substantial trial-to-trial inconsistency. An agent might pass a task in one run due to a lucky sample but fail in the next, yet standard leaderboards report only a single pass/fail rate, hiding this instability.
Key Novelty
Application of Intraclass Correlation (ICC) to Agent Evaluation
  • Decomposes total evaluation variance into two components: between-query variance (how much tasks differ in difficulty) and within-query variance (how inconsistent the agent is on the same task)
  • Uses ICC as a standardized reliability metric: high ICC indicates variance is mostly due to task difficulty (good), while low ICC signals noisy, inconsistent agent behavior
  • Provides empirical guidelines for resampling budgets (N items vs. T trials) to minimize variance under fixed computational costs
Evaluation Highlights
  • Agentic tasks (GAIA) exhibit ICC ranges of 0.304–0.774, indicating significant instability compared to reasoning/retrieval tasks (FRAMES) which range from 0.4955–0.7118
  • ICC estimates converge effectively with n=8–16 trials for structured tasks and n≥32 for complex reasoning tasks
  • Allocating budget to more items (n=100) with fewer trials (T=4) yields 68% lower standard error than fewer items (n=10) with many trials (T=40)
Breakthrough Assessment
8/10
Crucial methodological contribution. While not a new model architecture, it provides the statistical rigor missing from the agentic field, potentially transforming how leaderboards operate.
×