MCP: Model Context Protocol—an open standard (2024) providing a uniform interface for LLMs to discover and connect to external tools and data sources
Oracle Mode: An evaluation setting where only the minimal set of tools required to solve the specific task is loaded into the model's context
Max-Scale Mode: An evaluation setting where all 65 MCPs (550+ tools) are loaded simultaneously, testing the model's ability to handle massive action spaces
Agentic: Refers to AI systems that actively reason, plan, and execute multi-step actions to achieve a goal, rather than just generating text
SOTA: State-of-the-Art—the current best performance levels achieved by leading models
Context Window: The limit on the amount of text (tokens) an LLM can process at one time; crucial here because 550+ tool definitions take ~147k tokens
Hybrid Outcome-Based Evaluation: A scoring method that checks the final result (using an LLM judge for text or scripts for file changes) rather than checking if the model followed a specific sequence of steps