ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

📝 Paper Summary

Benchmark Tool-use agents Multi-turn w. user interactions

ToolSandbox is a Python-native benchmark for tool-use agents that evaluates stateful, conversational interactions using an LLM-simulated user and a flexible milestone-based scoring system.

Core Problem

Existing tool-use benchmarks rely on stateless web APIs, single-turn prompts, or fixed off-policy trajectories, failing to capture the complexity of real-world scenarios where tools depend on changing world states and users interact conversationally.

Why it matters:

Real-world tasks often involve implicit dependencies (e.g., turning on WiFi before searching) that current stateless benchmarks miss.
Static or single-turn evaluations cannot measure an agent's ability to correct errors or handle follow-up clarifications in a live session.
Fixed-trajectory benchmarks (like API-Bank) penalize valid alternative solutions that achieve the same goal through different steps.

Concrete Example: A user asks to 'send a message'. If cellular service is off, the tool fails. A capable agent must catch the error, turn on service (changing the world state), and retry. Current benchmarks would simply mark the initial failure as a zero or ignore the state dependency entirely.

Key Novelty

Stateful, Interactive World with Milestone-Based Scoring

Introduces 'Stateful Tools' where execution depends on a persistent world state (e.g., location, settings) that agents must track and modify.
Uses a 'Milestone and Minefield' evaluation system that scores trajectories based on reaching necessary intermediate states (DAG-based) rather than matching a rigid reference sequence.
Deploys a 'User Simulator' enhanced with Knowledge Boundaries and Demonstrations to enable on-policy conversational evaluation where the user reacts to the agent's specific actions.

Architecture

The overall workflow of the ToolSandbox evaluation framework, illustrating the interaction loop between User, Agent, and Execution Environment.

Evaluation Highlights

State Dependency tasks cause a massive performance drop: GPT-4o drops from ~85% on easier tasks to 42.1% on nested state dependency scenarios.
Insufficient Information scenarios reveal high hallucination rates: even top models fail to identify unsolvable tasks, with pass rates often near 0% for some models.
Proprietary models (GPT-4o) significantly outperform open weights models (Llama-3-70B-Instruct) by large margins (e.g., +30-40%) on complex stateful tasks.

Breakthrough Assessment

8/10

Significantly advances tool-use evaluation by moving beyond static API calls to stateful, interactive environments. The milestone scoring system solves the 'multiple valid paths' problem in dialog evaluation.

⚙️ Technical Details

Problem Definition

Setting: Task-oriented dialog where an Agent interacts with a User and an Execution Environment to complete a task using defined Tools.

Inputs: Natural language user query and a set of available Python tools.

Outputs: A sequence of tool calls and natural language responses culminating in a final world state.

Pipeline Flow

User Simulator initiates conversation
Agent receives message and decides to call Tool or respond to User
Execution Environment executes Tool in Python Sandbox
Execution Context (World State) is updated
Cycle repeats until User signals completion
Milestone Scorer evaluates entire trajectory

System Modules

User Simulator

Simulates human intent and responses; decides when task is done.

Model or implementation: GPT-4o

Agent

The LLM being evaluated; generates tool calls or text responses.

Model or implementation: Various (e.g., GPT-4o, Llama-3)

Execution Environment

Executes Python code for tools, manages World State.

Model or implementation: Python Interpreter (code.InteractiveConsole)

Milestone Scorer

Calculates final score based on Milestones (must happen) and Minefields (must not happen).

Model or implementation: Deterministic Algorithm

Novel Architectural Elements

Milestone/Minefield DAG-based evaluation logic: Decouples scoring from exact step-by-step matching, allowing flexible ordering of valid actions.
Stateful Execution Context: Explicit modeling of dependencies (e.g., WiFi status) that persist across turns.

Comparison to Prior Work

vs. BFCL: ToolSandbox is multi-turn, stateful, and interactive, whereas BFCL is single-turn and stateless.
vs. ToolEval: ToolSandbox uses deterministic Milestone matching for scoring instead of relying on an LLM judge for pass rates.
vs. API-Bank: ToolSandbox supports on-policy (live) conversational evaluation with a user simulator, whereas API-Bank uses fixed datasets.
+ 1 more
vs. Tau-bench: ToolSandbox allows flexible trajectories via DAG milestones, whereas Tau-bench requires exact matching to a predetermined sequence.

Limitations

User Simulator reliance: Evaluation quality depends on GPT-4o's ability to simulate a user accurately (though mitigated by Knowledge Boundaries).
Complexity of annotation: Creating new scenarios requires expert annotation of Milestone DAGs and Python tool definitions.
Limited tool domains: Currently covers 11 specific domains (messaging, settings, etc.), which may not generalize to all tool-use contexts.

Reproducibility

Code: https://github.com/apple/ToolSandbox

publicly available (https://github.com/apple/ToolSandbox). Includes the benchmark scenarios, the execution environment, and the user simulator prompts. Model weights for open-source baselines (Llama-3, Mistral) are external. Proprietary models (GPT-4, Claude) require API access.

📊 Experiments & Results

Evaluation Setup

Interactive tool-use scenarios across 1032 test cases.

Benchmarks:

ToolSandbox (Stateful, Conversational Tool Use) [New]

Metrics:

Success Rate (SR) based on Milestone completion
Minefield Avoidance Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the full ToolSandbox benchmark shows a significant gap between proprietary state-of-the-art models and open-weights models.
ToolSandbox	Success Rate (SR)	57.7	83.6	+25.9
ToolSandbox	Success Rate (SR)	63.7	83.6	+19.9
ToolSandbox	Success Rate (SR)	78.4	83.6	+5.2
Scenario-specific breakdowns reveal that State Dependency and Insufficient Information are the hardest categories.
ToolSandbox (State Dependency)	Success Rate (SR)	83.6	69.1	-14.5
ToolSandbox (Insufficient Information)	Success Rate (SR)	83.6	31.6	-52.0
ToolSandbox (Insufficient Information)	Success Rate (SR)	0.0	31.6	+31.6

Experiment Figures

A conceptual example of a State Dependency task.

Main Takeaways

Proprietary models (GPT-4o, Claude-3-Opus) significantly outperform open-source models (Llama-3, Mistral) on stateful tool use.
Implicit State Dependency is a major failure mode; models struggle to realize they must alter the world state (e.g., turn on WiFi) to enable other tools.
Insufficient Information scenarios are extremely challenging; most models tend to hallucinate tool calls rather than correctly identifying that the task is unsolvable.
Milestone-based evaluation proves robust for handling multiple valid trajectories compared to rigid sequence matching.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM tool use / function calling
Familiarity with task-oriented dialog systems
Basic knowledge of Directed Acyclic Graphs (DAGs) for dependency tracking

Key Terms

Stateful Tools: Tools that inspect, depend on, or modify a persistent world state (e.g., a database or system setting) rather than just returning a static value.

Milestones: Critical events (e.g., specific tool calls or state changes) that must occur in a trajectory for a task to be considered successful.

Minefields: Events that must NOT occur in a trajectory (e.g., calling a specific tool when information is insufficient); violation results in a zero score.

User Simulator: An LLM (GPT-4o) prompted to act as the human user, providing inputs and feedback to the agent during evaluation.

Execution Context: The abstraction of the 'World State' (variables, databases, settings) that is modified by tool execution.

Canonicalization: The process of transforming natural language arguments into a standardized format required by an API (e.g., 'next Friday' to '2024-05-24').

On-policy evaluation: Evaluating the agent by letting it interact dynamically with the environment/user, rather than grading it against a pre-recorded static transcript.