Stochasticity in Agentic Evaluations: Quantifying Inconsistency with Intraclass Correlation

📝 Paper Summary

Agentic Benchmarking Evaluation Reliability Stochasticity in LLMs

The paper proposes using Intraclass Correlation Coefficient (ICC) to decompose agent evaluation variance into task difficulty versus agent inconsistency, ensuring reported improvements reflect true capability rather than lucky sampling.

Core Problem

Current agentic benchmarks typically report a single accuracy number from one run, obscuring critical variance caused by model stochasticity, API instability, and prompt ambiguity.

Why it matters:

Unreliable sub-agents introduce brittleness into larger downstream systems
Without measuring variance, it is impossible to distinguish genuine capability improvements from random noise (lucky sampling)
Comparisons between agents risk overstating differences when standard errors and confidence intervals are ignored

Concrete Example: Per-question accuracy estimates on GAIA show substantial trial-to-trial inconsistency. An agent might pass a task in one run due to a lucky sample but fail in the next, yet standard leaderboards report only a single pass/fail rate, hiding this instability.

Key Novelty

Application of Intraclass Correlation (ICC) to Agent Evaluation

Decomposes total evaluation variance into two components: between-query variance (how much tasks differ in difficulty) and within-query variance (how inconsistent the agent is on the same task)
Uses ICC as a standardized reliability metric: high ICC indicates variance is mostly due to task difficulty (good), while low ICC signals noisy, inconsistent agent behavior
Provides empirical guidelines for resampling budgets (N items vs. T trials) to minimize variance under fixed computational costs

Evaluation Highlights

Agentic tasks (GAIA) exhibit ICC ranges of 0.304–0.774, indicating significant instability compared to reasoning/retrieval tasks (FRAMES) which range from 0.4955–0.7118
ICC estimates converge effectively with n=8–16 trials for structured tasks and n≥32 for complex reasoning tasks
Allocating budget to more items (n=100) with fewer trials (T=4) yields 68% lower standard error than fewer items (n=10) with many trials (T=40)

Breakthrough Assessment

8/10

Crucial methodological contribution. While not a new model architecture, it provides the statistical rigor missing from the agentic field, potentially transforming how leaderboards operate.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of stochastic agentic systems where outcomes vary across independent trials

Inputs: Set of evaluation tasks (questions/prompts) and an Agent

Outputs: Reliability metrics (ICC), variance decomposition (between-task vs. within-task), and confidence intervals for accuracy

Pipeline Flow

Execute Agent on Benchmark Tasks (n items, T trials)
Collect Pass/Fail Outcomes
Calculate Variance Components (Between-Task vs. Within-Task)
Compute ICC and Confidence Intervals

System Modules

Agent Execution Loop

Runs the agent against the environment/tasks multiple times

Model or implementation: Various (e.g., Llama-3-8B-Instruct, GPT-4 based agents)

Statistical Analyzer

Computes ICC, standard errors, and confidence intervals

Model or implementation: Statistical formulas (ICC(1,1), McNemar's test)

Novel Architectural Elements

Integration of Psychometric Reliability Analysis (ICC) into Agentic Evaluation pipelines
Evidence-based resampling budget allocation strategy (optimizing n vs T)

Modeling

Base Model: Case studies use various models (implied, specific models for GAIA/FRAMES results not detailed in text beyond general references to 'models')

Compute: Not reported in the paper

Comparison to Prior Work

vs. Pass@k: ICC explicitly measures *reliability* and decomposes variance sources, whereas Pass@k just averages performance
vs. Chatbot Arena: Focuses on objective agentic task reliability (within-agent consistency) rather than pairwise preference Elo ratings
vs. Miller et al. (2024): Builds on their statistical treatment by introducing ICC specifically for characterizing agent inconsistency vs. task difficulty
+ 1 more
vs. Typical Leaderboards: Replaces single-point estimates with confidence intervals and reliability metrics

Limitations

Low ICC doesn't always mean a bad benchmark; some tasks legitimately require exploration
Increasing trials (T) is computationally expensive
ICC calculation assumes tasks are a random sample from a larger population (One-way random effects model)

Reproducibility

Code: https://github.com/youdotcom-oss/stochastic-agent-evals

📊 Experiments & Results

Evaluation Setup

Multi-trial execution of agents on standard benchmarks to measure variance

Benchmarks:

GAIA (Agentic capabilities (reasoning, tool use, multi-step))
FRAMES (Retrieval and factuality (RAG scenarios))

Metrics:

Intraclass Correlation Coefficient (ICC)
Accuracy (with 95% Confidence Intervals)
Within-query variance
Between-query variance
Statistical methodology: ICC(1,1) one-way random effects model; Standard Error of the Mean for accuracy; McNemar's test for paired comparisons

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ICC analysis reveals that agentic tasks (GAIA) generally exhibit lower and more variable reliability than retrieval/reasoning tasks (FRAMES), reflecting the higher stochasticity in multi-step agent actions.
FRAMES	ICC Range	N/A	0.4955–0.7118	N/A
GAIA	ICC Range	N/A	0.304–0.774	N/A
Variance reduction analysis demonstrates that spreading compute budget across more items is more efficient than running many trials on fewer items.
Theoretical Simulation	Standard Error Reduction	1.0	0.32	-0.68

Experiment Figures

Per-question accuracy estimates with 95% confidence intervals sorted by difficulty

Standard Error of accuracy estimate as a function of trials (T) vs items (n) for a fixed budget

Main Takeaways

Evaluation stability varies significantly with task complexity: GAIA (agentic) is less stable than FRAMES (RAG).
Single-run accuracy is an unreliable metric for agentic systems; trustworthy improvements require improvements in ICC (consistency) as well.
Optimal evaluation strategy prioritizes maximizing the number of tasks (n) over the number of trials (T) until tasks are exhausted, to minimize standard error.
ICC convergence analysis suggests n=8–16 trials are needed for structured tasks, while n≥32 is needed for complex reasoning to get stable estimates.

📚 Prerequisite Knowledge

Prerequisites

Basic statistics (variance, standard error, confidence intervals)
Understanding of LLM sampling (temperature, top-p)
Familiarity with agentic benchmarks (GAIA, FRAMES)

Key Terms

ICC: Intraclass Correlation Coefficient—a statistic describing how strongly units in the same group resemble each other; here, it measures how consistent an agent's performance is across multiple trials of the same task

Agentic systems: LLM-based systems that use tools, interact with environments, and execute multi-step plans rather than just predicting next tokens

Between-task variance: Variance in scores caused by some tasks being inherently harder or easier than others

Within-task variance: Variance in scores caused by the agent behaving differently on the exact same task across repeated trials (inconsistency)

McNemar's test: A statistical test used on paired nominal data to determine if there is a significant difference between two agents' performance on the same set of items

Bootstrapping: A resampling method used to estimate standard errors and confidence intervals by repeatedly sampling from the observed data with replacement

MCP: Model Context Protocol—a standard for connecting AI assistants to systems and tools

SFT: Supervised Fine-Tuning—training a model on labeled examples

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

ANOVA: Analysis of Variance—a statistical method used to analyze the differences among group means in a sample