Evaluation Setup
Intrinsic assessment of dataset quality using automated metrics compared against human annotation
Benchmarks:
- ToolBench (Tool Learning (Complex, real-world APIs))
- ToolAlpaca (Tool Learning (Simpler, synthetic documentation))
Metrics:
- Agreement (Precision/Recall/F1) of automated metrics vs human labels
- Error Rate (Percentage of invalid instances in dataset)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The following results quantify the noise levels in standard tool-learning datasets, revealing significant quality issues. |
| ToolBench (Train) |
Parameter Alignment Error Rate |
0 |
33 |
+33
|
| ToolAlpaca (Train) |
Parameter Alignment Error Rate |
0 |
33 |
+33
|
Main Takeaways
- Significant noise exists in current SOTA tool-learning datasets; >33% of training examples contain parameter hallucinations or missing values.
- ToolBench contains a much higher percentage of errors in instruction specificity and coherence compared to ToolAlpaca, likely due to its use of complex real-world APIs vs ToolAlpaca's cleaner synthetic scope.
- Automated metrics using ChatGPT (via proxy tasks like extraction and NSP) achieve high alignment with human judgment, offering a scalable way to filter these large datasets.
- Quality criteria must cover both the instruction (Input) and the API sequence (Output); errors are prevalent in both.