← Back to Paper List

Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs

Shadi Iskander, Nachshon Cohen, Zohar S. Karnin, Ori Shapira, Sofia Tolmach
Amazon, Technology Innovation Institute, OriginAI, Technion - Israel Institute of Technology
Conference on Empirical Methods in Natural Language Processing (2024)
Agent Benchmark

📝 Paper Summary

Data Centric AI Tool Learning Synthetic Data Evaluation
The paper introduces intrinsic metrics and automated judges to validate synthetic training data for tool-using LLMs, revealing high error rates in popular benchmarks like ToolBench.
Core Problem
Synthetic datasets for tool-using LLMs are generated without quality checks, leading to models being trained on erroneous instructions and invalid API calls.
Why it matters:
  • Current evaluation focuses only on extrinsic model outputs (pass rate), ignoring the root cause of failures: poor training data
  • Resources are wasted tuning models on noisy data containing hallucinations and logic errors
  • Leading benchmarks like ToolBench were created with ChatGPT but never explicitly assessed for quality
Concrete Example: A synthetic instruction might request an API call but fail to provide necessary parameter values in the text. Consequently, the ground-truth API sequence 'hallucinates' these parameters. A model trained on this learns to hallucinate arguments rather than extracting them.
Key Novelty
Intrinsic Quality Evaluation Framework for Tool Data
  • Defines six specific quality criteria for tool-use data: three for the natural language instruction (e.g., Specificity) and three for the API sequence (e.g., Parameter Alignment)
  • Implements automated metrics using ChatGPT to judge these criteria, transforming qualitative checks into standard NLP tasks like extraction and next-sentence prediction
Evaluation Highlights
  • Over 33% of instances in both ToolBench and ToolAlpaca training sets contain parameter alignment errors (missing or hallucinated parameters)
  • Automated metrics demonstrate high recall and precision when compared against expert human annotations (F1 alignment validated on 50 samples per dataset)
  • ToolBench is found to have significantly higher error rates than ToolAlpaca due to higher instruction complexity and inconsistent real-world API documentation
Breakthrough Assessment
7/10
Important contribution to Data-Centric AI for agents. Highlights severe quality issues in standard benchmarks, though the provided text lacks the downstream model performance results to fully prove the impact.
×