Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark

📝 Paper Summary

Tool-use post-training Benchmark datasets Synthetic data generation

Seal-Tools is a large-scale tool learning dataset constructed via a self-instruct method that features nested tool callings and enables precise, format-controlled evaluation of LLM agents.

Core Problem

Existing tool learning datasets suffer from limited scale, simple instances that are easily solved without complex reasoning, duplications, and inaccurate evaluation methods (like ChatGPT-based scoring) due to lack of strict format control.

Why it matters:

Current LLMs hallucinate when generating tool data, leading to unreliable training sets
Limited context length in generation leads to repetitive tools and simple queries
Existing benchmarks often lack nested tool calling scenarios (where one tool's output feeds another's input), which are critical for real-world agent complexity

Concrete Example: In ToolBench, nearly 34% of tools have no required parameters, making them too easy. A standard LLM generation approach might produce a simple query like 'check weather,' whereas Seal-Tools generates nested instances like 'Find the email of the author of book X,' requiring one tool to find the author and another to find the email using that name.

Key Novelty

Self-Instruct Pipeline for Nested Tool Data Generation

Uses a three-stage generation process (Field → Tool → Instance) to ensure diversity and reduce duplication compared to direct generation
Introduces 'nested instances' where tool calls form a directed acyclic graph (output of tool A becomes input of tool B), simulating complex real-world workflows
Enforces strict JSON output formats to enable deterministic, rule-based evaluation metrics rather than relying on unstable LLM-based judging

Evaluation Highlights

Seal-Tools finetuned model achieves 71.91% Argument F1 on the Test (Hard) split, significantly outperforming Llama-2-7b-chat (0.00%)
In nested tool calling scenarios, the finetuned model reaches 62.44% Argument F1, validating the dataset's effectiveness for complex logic
Standard models like Llama-2-7b-chat fail completely (0.00% across metrics) on this benchmark due to strict format requirements, highlighting the difficulty of the dataset

Breakthrough Assessment

7/10

Strong contribution in synthetic data generation for agents, particularly for nested tool calls. The strict evaluation metrics are a welcome shift from LLM-as-a-judge, though the method relies heavily on standard self-instruct patterns.

⚙️ Technical Details

Problem Definition

Setting: Tool learning for LLM agents, involving tool understanding, selection, and parameter filling

Inputs: User query and a set of available tools (definitions including name, description, parameters)

Outputs: Structured API calls (tool name and parameter values) in JSON format to solve the query

Pipeline Flow

Field Generation: Generate diverse domains (fields/subfields)
Tool Generation: Create APIs for each subfield with parameters
Instance Generation: Create user queries and corresponding tool call chains (single, multiple, nested)

System Modules

Field Generator (Data Construction)

Generate a hierarchical list of fields and subfields to ensure domain diversity

Model or implementation: ChatGPT

Tool Generator (Data Construction)

Generate API definitions (tools) for specific subfields

Model or implementation: ChatGPT

Instance Generator (Multi-step) (Data Construction)

Generate queries and tool invocation chains

Model or implementation: ChatGPT

Novel Architectural Elements

Hierarchical anchor-based generation: Uses Fields -> Tools -> Instances structure to enforce diversity and context validity
Two-step 'blank filling' generation for complex instances: Separates tool selection from parameter filling to enable reliable nested call generation

Modeling

Base Model: Llama-2-7b-chat (as the foundation for Seal-Tools finetuning)

Training Method: Supervised Fine-Tuning (SFT)

Adaptation: Full fine-tuning (implied by context of 'finetuned model')

Trainable Parameters: Not reported in the paper

Training Data:

Seal-Tools dataset containing 54,676 training instances
Split into Train (54k), Test Easy (1.5k), Test Hard (1.3k)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128
epochs: 3
+ 1 more
max_length: 4096

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolBench: Seal-Tools focuses on single-turn complex calling (nested/parallel) rather than multi-turn chat; uses deterministic JSON metrics instead of ChatGPT-based 'Pass Rate'
vs. API-Bank: Seal-Tools is fully open-source (API-Bank training data is not public per paper) and significantly larger in tool count
vs. Gorilla [not cited in paper]: Gorilla focuses on retrieval-aware fine-tuning; Seal-Tools emphasizes the complexity of the reasoning chain (nested calls) and strict format compliance

Limitations

Dataset scale is constrained by funding, not reaching the theoretical limit of the method
Model-generated parameter entities (e.g., phone numbers) are 'made up' and may raise privacy/accuracy concerns if not filtered
Evaluation metrics are strict exact-match or overlap-based, which might penalize semantically correct but syntactically different valid responses
Relying on ChatGPT for data generation inherits biases and hallucination tendencies of the source model

Reproducibility

Code: https://github.com/fairyshine/Seal-Tools

Code and data are publicly available at https://github.com/fairyshine/Seal-Tools. The repository contains the dataset, generation code, and evaluation scripts. The specific prompt templates used for generation are described in the methodology section.

📊 Experiments & Results

Evaluation Setup

Agent tuning and evaluation on unseen tools/queries

Benchmarks:

Seal-Tools Test Set (Easy) (Single-tool calling) [New]
Seal-Tools Test Set (Hard) (Multiple and nested tool calling) [New]

Metrics:

Format Correctness (Format)
Tool Selection F1 (Action F1)
Parameter Argument F1 (Argument F1)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis of the Seal-Tools Finetuned model against baseline LLMs (Llama-2, ChatGPT, GPT-4) on the Seal-Tools Hard Test set (involving multiple/nested tools).
Seal-Tools Test (Hard)	Argument F1	59.33	71.91	+12.58
Seal-Tools Test (Hard)	Action F1	78.96	90.23	+11.27
Seal-Tools Test (Hard)	Format	0.00	99.85	+99.85
Breakdown of performance on Nested Tool Calling specifically, showcasing the capability to handle tool dependencies.
Seal-Tools Test (Nested subset)	Argument F1	46.33	62.44	+16.11

Main Takeaways

Current off-the-shelf models (Llama-2-7b-chat) are completely incapable of zero-shot strict format compliance for complex tool use (0% success).
Finetuning on Seal-Tools allows a 7B model to outperform GPT-4 on in-domain tasks, particularly in strict parameter formatting and tool selection.
Nested tool calling remains a challenging task; while the finetuned model reaches ~62% F1, it is lower than single-tool performance (~92%), indicating room for improvement in logical planning.
The dataset effectively reduces 'easy' instances: only 6% of tools have no required parameters compared to 34% in ToolBench.

📚 Prerequisite Knowledge

Prerequisites

In-context learning (ICL)
Instruction tuning / Agent tuning
API structure (JSON schema)

Key Terms

Self-instruct: A method where a strong LLM generates instruction-response pairs to create a dataset for fine-tuning other models

Nested tool calling: A scenario where the output of one tool execution is required as an input parameter for a subsequent tool call

ICL: In-Context Learning—prompting an LLM with examples (demonstrations) to guide its generation without updating weights

Hallucination: The tendency of LLMs to generate plausible but incorrect or non-existent facts (e.g., inventing fake tool parameters)

Directed Acyclic Graph (DAG): A graph structure with no loops; used here to describe the dependency flow between multiple tool calls in a single query

Argument F1: A metric evaluating whether the predicted parameter values match the ground truth, balancing precision and recall

Rouge-L: A metric measuring text overlap based on the longest common subsequence, used here to evaluate the similarity of generated tool parameters to reference values