Stateful Tools: Tools that inspect, depend on, or modify a persistent world state (e.g., a database or system setting) rather than just returning a static value.
Milestones: Critical events (e.g., specific tool calls or state changes) that must occur in a trajectory for a task to be considered successful.
Minefields: Events that must NOT occur in a trajectory (e.g., calling a specific tool when information is insufficient); violation results in a zero score.
User Simulator: An LLM (GPT-4o) prompted to act as the human user, providing inputs and feedback to the agent during evaluation.
Execution Context: The abstraction of the 'World State' (variables, databases, settings) that is modified by tool execution.
Canonicalization: The process of transforming natural language arguments into a standardized format required by an API (e.g., 'next Friday' to '2024-05-24').
On-policy evaluation: Evaluating the agent by letting it interact dynamically with the environment/user, rather than grading it against a pre-recorded static transcript.