Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs

📝 Paper Summary

Data Centric AI Tool Learning Synthetic Data Evaluation

The paper introduces intrinsic metrics and automated judges to validate synthetic training data for tool-using LLMs, revealing high error rates in popular benchmarks like ToolBench.

Core Problem

Synthetic datasets for tool-using LLMs are generated without quality checks, leading to models being trained on erroneous instructions and invalid API calls.

Why it matters:

Current evaluation focuses only on extrinsic model outputs (pass rate), ignoring the root cause of failures: poor training data
Resources are wasted tuning models on noisy data containing hallucinations and logic errors
Leading benchmarks like ToolBench were created with ChatGPT but never explicitly assessed for quality

Concrete Example: A synthetic instruction might request an API call but fail to provide necessary parameter values in the text. Consequently, the ground-truth API sequence 'hallucinates' these parameters. A model trained on this learns to hallucinate arguments rather than extracting them.

Key Novelty

Intrinsic Quality Evaluation Framework for Tool Data

Defines six specific quality criteria for tool-use data: three for the natural language instruction (e.g., Specificity) and three for the API sequence (e.g., Parameter Alignment)
Implements automated metrics using ChatGPT to judge these criteria, transforming qualitative checks into standard NLP tasks like extraction and next-sentence prediction

Evaluation Highlights

Over 33% of instances in both ToolBench and ToolAlpaca training sets contain parameter alignment errors (missing or hallucinated parameters)
Automated metrics demonstrate high recall and precision when compared against expert human annotations (F1 alignment validated on 50 samples per dataset)
ToolBench is found to have significantly higher error rates than ToolAlpaca due to higher instruction complexity and inconsistent real-world API documentation

Breakthrough Assessment

7/10

Important contribution to Data-Centric AI for agents. Highlights severe quality issues in standard benchmarks, though the provided text lacks the downstream model performance results to fully prove the impact.

⚙️ Technical Details

Problem Definition

Setting: Data Quality Assessment for Tool-Using LLM Datasets

Inputs: A training instance consisting of an instruction query q and a ground-truth API call sequence S

Outputs: Binary validity scores for six intrinsic quality criteria

Pipeline Flow

Input Processing (Instruction + API Sequence)
Validation (Parallel Checks via ChatGPT)
Score Aggregation

System Modules

Specificity Validator (Validation)

Check if instruction contains all necessary details

Model or implementation: gpt-3.5-turbo-0613

Coherence Validator (Validation)

Check if instruction sentences follow a logical order

Model or implementation: gpt-3.5-turbo-0613

Parameter Alignment Validator (Validation)

Check if ground truth API parameters match instruction details

Model or implementation: gpt-3.5-turbo-0613

Novel Architectural Elements

Transformation of qualitative data assessment into standard NLP proxy tasks (Extraction, Next Sentence Prediction) to improve judge reliability

Modeling

Base Model: gpt-3.5-turbo-0613 (used as the automated judge)

Training Method: The paper focuses on data evaluation. The evaluation phase uses pre-trained ChatGPT. (Downstream model training experiments are mentioned but details are missing in the text).

Adaptation: Prompt-based evaluation (no fine-tuning of the judge)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolBench/ToolAlpaca: This paper evaluates the *quality* of these datasets, whereas the original papers focused on generation and model training
vs. LIMA: Applies the 'less is more' data quality hypothesis specifically to the domain of Tool Usage (API calls)
vs. T-Eval: T-Eval evaluates the *model* intrinsically (sub-tasks), whereas this paper evaluates the *data* intrinsically

Limitations

Evaluation relies on ChatGPT (GPT-3.5), which may have its own biases or errors as a judge
Manual annotation was limited to a small sample (50 pairs per dataset) due to labor intensity
Analysis is limited to English language instructions
Provided text ends at Section 4.3.2; results regarding the impact on downstream model training are missing

Reproducibility

The paper provides the logic for the prompts (Appendix A.2 referenced) and the methodology for mapping criteria to NLP tasks. Annotated datasets (50 pairs each) were created but no repository URL is provided in the text. The specific prompts are mentioned as being in the Appendix.

📊 Experiments & Results

Evaluation Setup

Intrinsic assessment of dataset quality using automated metrics compared against human annotation

Benchmarks:

ToolBench (Tool Learning (Complex, real-world APIs))
ToolAlpaca (Tool Learning (Simpler, synthetic documentation))

Metrics:

Agreement (Precision/Recall/F1) of automated metrics vs human labels
Error Rate (Percentage of invalid instances in dataset)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The following results quantify the noise levels in standard tool-learning datasets, revealing significant quality issues.
ToolBench (Train)	Parameter Alignment Error Rate	0	33	+33
ToolAlpaca (Train)	Parameter Alignment Error Rate	0	33	+33

Main Takeaways

Significant noise exists in current SOTA tool-learning datasets; >33% of training examples contain parameter hallucinations or missing values.
ToolBench contains a much higher percentage of errors in instruction specificity and coherence compared to ToolAlpaca, likely due to its use of complex real-world APIs vs ToolAlpaca's cleaner synthetic scope.
Automated metrics using ChatGPT (via proxy tasks like extraction and NSP) achieve high alignment with human judgment, offering a scalable way to filter these large datasets.
Quality criteria must cover both the instruction (Input) and the API sequence (Output); errors are prevalent in both.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Tool Learning/Tool-Use in LLMs
Familiarity with synthetic data generation techniques (e.g., Self-Instruct)
Basic knowledge of API structures (parameters, calls)

Key Terms

Tool-using LLM: An LLM trained to generate API calls to external tools to fulfill a user request

Intrinsic Evaluation: Evaluating the quality of the data itself (the input/output pairs) rather than the performance of a model trained on it

ICE: In-Context Evaluation—a proposed metric where a data instance is evaluated by its helpfulness as a few-shot example for a proxy task

Parameter Alignment: A quality criterion checking if API call parameters are actually present in or inferable from the instruction text

Specificity: A quality criterion checking if the instruction contains all necessary details to formulate the API request

Coherence: A quality criterion checking if the requests in an instruction are logically related and ordered correctly

SFT: Supervised Fine-Tuning—training the model on labeled instruction-response pairs