PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

📝 Paper Summary

Multi-call tool use with fixed plan Tool profiling

Play2Prompt automatically generates tool-use examples and refines documentation by letting an LLM agent 'play' with tools through trial-and-error, enabling zero-shot tool use without human labeling.

Core Problem

Task LLMs struggle to use new tools zero-shot when documentation is noisy or incomplete, and existing optimization methods require labeled examples which are unavailable in true zero-shot settings.

Why it matters:

Real-world users often provide minimal or poor documentation for custom tools, leading to hallucinations or syntax errors
Creating labeled tool-use demonstrations manually is unscalable for non-expert users
Current automatic prompt optimization techniques rely on seed examples, failing when no prior data exists

Concrete Example: If a user provides a tool `get_stock_price` with vague docs lacking required parameter formats, a standard zero-shot LLM might hallucinate parameters or fail. Play2Prompt trial-runs the tool until it works, then reverse-engineers a query like 'What is the price of AAPL?' to create a valid example.

Key Novelty

Play2Prompt (Zero-shot Tool Play Framework)

Systematically 'plays' with tools using trial-and-error to discover valid input parameters and observe outputs without any initial labeled data
Generates synthetic tool-use examples in reverse: first find a valid tool call, then generate a corresponding user query that would trigger it
Uses these synthetic examples as a validation set to refine the tool documentation itself via a beam search optimization process

Architecture

The Play2Prompt framework workflow, divided into Step 1 (Tool-Use Example Generation) and Step 2 (Tool Documentation Optimization).

Evaluation Highlights

+13.3% accuracy improvement on Berkeley Function-Calling Leaderboard (BFCL) using Llama-3.1-8B-Instruct compared to zero-shot baseline
Outperforms standard zero-shot prompting on StableToolBench across varying documentation quality levels (80%, 60%, 40% retained info)
Surpasses concurrent baseline Tool-Be-Honest by 8.4% on BFCL with Llama-3.1-8B-Instruct

Breakthrough Assessment

7/10

Clever application of 'tool play' (exploration) to solve the cold-start problem in tool use. Robust gains across open/closed models, but relies on the assumption that tools are safe to execute during exploration.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot tool utilization where the agent must use a set of tools F given only sub-optimal initial documentation D0 and no labeled examples

Inputs: User query x, initial tool documentation D0

Outputs: Refined documentation D and a set of synthesized few-shot examples E to guide the Task LLM

Pipeline Flow

Step 1: Tool-Use Example Generation (Reverse generation via Tool Play)
Step 2: Tool Documentation Optimization (Refinement using Step 1 examples as validation)

System Modules

Invocation Generator (Example Generation)

Explores valid tool parameters via trial-and-error 'play'

Model or implementation: Generator LLM (e.g., GPT-4o)

Query Generator (Example Generation)

Reverse-engineers a user query that matches the valid tool invocation

Model or implementation: Generator LLM (e.g., GPT-4o)

Documentation Refiner

Rewrites tool documentation to maximize performance on synthetic examples

Model or implementation: Generator LLM (e.g., GPT-4o)

Novel Architectural Elements

Reverse-generation pipeline: Validates tool execution *before* generating the corresponding user query
Adversarial optimization loop: Example generation seeks 'difficult' examples (low Task LLM performance), while documentation optimization seeks to maximize performance on those examples

Modeling

Base Model: Evaluated on GPT-4o, GPT-3.5-Turbo, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct, Mistral-Nemo-12B-Instruct

Comparison to Prior Work

vs. Tool-Be-Honest: Play2Prompt explicitly executes tools ('plays') to verify correctness, whereas Tool-Be-Honest relies on hallucinated parameters that may fail execution
vs. APO: APO requires an existing labeled dataset to optimize against; Play2Prompt synthesizes its own validation set from scratch
vs. OPRO [not cited in paper]: OPRO optimizes prompts via black-box search but needs a metric/labeled set; Play2Prompt creates the metric targets (examples) dynamically

Limitations

Requires tools to be executable during the 'play' phase (risky for tools with side effects like 'delete_file')
Computationally expensive due to iterative LLM calls and tool executions during beam search
Performance depends on the Generator LLM's ability to self-reflect and generate coherent queries
Single-tool examples only: does not explicitly generate multi-tool chains (though shows some transfer capability)

Reproducibility

Code: https://github.com/wfangtw/play2prompt

Code is publicly available on GitHub. The paper details the beam search parameters (width, depth) and the specific prompts used for reflection and generation are implied to be in the codebase.

📊 Experiments & Results

Evaluation Setup

Zero-shot tool use evaluation on public benchmarks with degraded/noisy documentation

Benchmarks:

Berkeley Function-Calling Leaderboard (BFCL) (Function calling / Tool use)
StableToolBench (Complex tool use with real APIs)

Metrics:

Accuracy (Acc)
Pass Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Berkeley Function-Calling Leaderboard (BFCL) showing improvements over Zero-Shot baseline and Tool-Be-Honest.
BFCL	Accuracy	73.33	86.67	+13.34
BFCL	Accuracy	78.27	86.67	+8.40
BFCL	Accuracy	87.05	90.00	+2.95
Ablation study on documentation quality (Doc Quality) using StableToolBench, showing robustness to missing information.
StableToolBench (Doc-40%)	Pass Rate	57.8	75.0	+17.2

Experiment Figures

Pass rates on StableToolBench across different documentation quality levels (100% to 0% retention) for various models.

Main Takeaways

Consistently outperforms standard Zero-Shot prompting and Tool-Be-Honest across both open (Llama, Mistral) and closed (GPT) models.
Highly effective at recovering performance when tool documentation is incomplete or noisy (Robustness to Doc Quality).
The 'Play' mechanism (execution-based verification) is critical; purely hallucinating examples without execution (like Tool-Be-Honest) is less effective.
The generated examples not only serve as few-shot prompts but also act as a validation set to refine the documentation itself.

📚 Prerequisite Knowledge

Prerequisites

Zero-shot prompting vs. Few-shot prompting
LLM Tool use / Function calling APIs
Beam search optimization

Key Terms

Zero-shot: The setting where the model handles a task without seeing any specific training examples for that task beforehand

Beam search: A search algorithm that explores a graph by expanding the most promising node in a limited set

Self-reflection: A process where an LLM evaluates its own output or errors to generate feedback for improvement

Rejection sampling: A technique used here to filter generated tool parameters; the system tries parameters until the tool executes successfully

Task LLM: The language model responsible for answering the user's query and invoking tools (distinct from the Generator LLM used for optimization)

Generator LLM: The language model used within Play2Prompt to create synthetic examples and refine documentation