Evaluation Setup
Validation of the tool generation pipeline's scale/diversity and the simulator's reliability
Benchmarks:
- ACEBench (Tool use simulation (used here to test simulator fidelity against ground truth))
- SynthTools Internal Benchmark (Synthetic tool generation and simulation) [New]
Metrics:
- Simulation Accuracy (Agreement with ground truth/rules)
- Audit Accuracy (Ability to detect errors)
- Diversity (Number of fields/tools)
- Statistical methodology: Manual inspection and LLM-as-a-judge verification
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Scale and diversity analysis showing SynthTools significantly exceeds prior hand-crafted or API-scraped baselines. |
| N/A (Dataset statistics) |
Number of Fields |
8 |
100 |
+92
|
| N/A (Dataset statistics) |
Tools per Field |
500 |
1000 |
+500
|
| Reliability experiments validating the Tool Simulation module against both SynthTools-generated tools and external benchmarks. |
| SynthTools Internal |
Accuracy |
N/A |
97 |
N/A
|
| ACEBench |
Accuracy |
100 |
94 |
-6
|
| SynthTools Internal |
Accuracy |
N/A |
99 |
N/A
|
Main Takeaways
- The hierarchical generation pipeline successfully creates diverse tools; embedding-based deduplication removed only 9% of tools, indicating 91% uniqueness
- The LLM-based simulator is highly reliable (94% accuracy on ACEBench), making it a feasible replacement for hard-coded sandboxes
- Even state-of-the-art models struggle with tasks generated from these tools, confirming they present a meaningful challenge (though exact agent performance numbers are not the primary focus)