ICC: Intraclass Correlation Coefficient—a statistic describing how strongly units in the same group resemble each other; here, it measures how consistent an agent's performance is across multiple trials of the same task
Agentic systems: LLM-based systems that use tools, interact with environments, and execute multi-step plans rather than just predicting next tokens
Between-task variance: Variance in scores caused by some tasks being inherently harder or easier than others
Within-task variance: Variance in scores caused by the agent behaving differently on the exact same task across repeated trials (inconsistency)
McNemar's test: A statistical test used on paired nominal data to determine if there is a significant difference between two agents' performance on the same set of items
Bootstrapping: A resampling method used to estimate standard errors and confidence intervals by repeatedly sampling from the observed data with replacement
MCP: Model Context Protocol—a standard for connecting AI assistants to systems and tools
SFT: Supervised Fine-Tuning—training a model on labeled examples
RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents
ANOVA: Analysis of Variance—a statistical method used to analyze the differences among group means in a sample