| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on AskToAct-Eval benchmark showing superiority over base Llama-3 and ToolLLaMA. | ||||
| AskToAct-Eval | Intent Recovery Rate (IRR) | 45.00 | 57.08 | +12.08 |
| AskToAct-Eval | Intent Recovery Rate (IRR) | 35.84 | 57.08 | +21.24 |
| AskToAct-Eval | Intent Recovery Rate (IRR) | 54.79 | 57.08 | +2.29 |
| AskToAct-Eval | Average Turn Number | 4.11 | 3.68 | -0.43 |
| Generalization to Unseen APIs (Level-3) demonstrates robustness. | ||||
| AskToAct-Eval (Level-3) | Intent Recovery Rate (IRR) | 41.67 | 51.19 | +9.52 |