← Back to Paper List

Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark

Mengsong Wu, Tong Zhu, Han Han, Chuanyuan Tan, Xiang Zhang, Wenliang Chen
Soochow University, Suzhou, China
Natural Language Processing and Chinese Computing (2024)
Agent Benchmark

📝 Paper Summary

Tool-use post-training Benchmark datasets Synthetic data generation
Seal-Tools is a large-scale tool learning dataset constructed via a self-instruct method that features nested tool callings and enables precise, format-controlled evaluation of LLM agents.
Core Problem
Existing tool learning datasets suffer from limited scale, simple instances that are easily solved without complex reasoning, duplications, and inaccurate evaluation methods (like ChatGPT-based scoring) due to lack of strict format control.
Why it matters:
  • Current LLMs hallucinate when generating tool data, leading to unreliable training sets
  • Limited context length in generation leads to repetitive tools and simple queries
  • Existing benchmarks often lack nested tool calling scenarios (where one tool's output feeds another's input), which are critical for real-world agent complexity
Concrete Example: In ToolBench, nearly 34% of tools have no required parameters, making them too easy. A standard LLM generation approach might produce a simple query like 'check weather,' whereas Seal-Tools generates nested instances like 'Find the email of the author of book X,' requiring one tool to find the author and another to find the email using that name.
Key Novelty
Self-Instruct Pipeline for Nested Tool Data Generation
  • Uses a three-stage generation process (Field → Tool → Instance) to ensure diversity and reduce duplication compared to direct generation
  • Introduces 'nested instances' where tool calls form a directed acyclic graph (output of tool A becomes input of tool B), simulating complex real-world workflows
  • Enforces strict JSON output formats to enable deterministic, rule-based evaluation metrics rather than relying on unstable LLM-based judging
Evaluation Highlights
  • Seal-Tools finetuned model achieves 71.91% Argument F1 on the Test (Hard) split, significantly outperforming Llama-2-7b-chat (0.00%)
  • In nested tool calling scenarios, the finetuned model reaches 62.44% Argument F1, validating the dataset's effectiveness for complex logic
  • Standard models like Llama-2-7b-chat fail completely (0.00% across metrics) on this benchmark due to strict format requirements, highlighting the difficulty of the dataset
Breakthrough Assessment
7/10
Strong contribution in synthetic data generation for agents, particularly for nested tool calls. The strict evaluation metrics are a welcome shift from LLM-as-a-judge, though the method relies heavily on standard self-instruct patterns.
×