| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Public Library Results: ToolCoder significantly outperforms baselines on standard libraries, especially when using the 2B parameter model. | ||||
| NumpyEval | pass@1 | 31.47 | 41.58 | +10.11 |
| TorchDataEval | pass@1 | 6.00 | 11.80 | +5.80 |
| Private Library Results: ToolCoder demonstrates strong generalization to libraries completely unseen during pre-training by leveraging documentation search. | ||||
| MonkeyEval | pass@1 | 1.59 | 3.02 | +1.43 |
| BeatNumEval | pass@1 | 5.94 | 6.93 | +0.99 |
| Ablation Studies: Confirm the necessity of the tool and the query generation step. | ||||
| NumpyEval | pass@1 | 33.76 | 35.64 | +1.88 |
| NumpyEval | pass@1 | 14.05 | 35.64 | +21.59 |