| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance in OOD Dynamic Environments (P_c in prompt, P_s_OOD on server). This is the most challenging setting where server tools differ significantly from prompt tools. | ||||
| ToolQA-D (Average) | Accuracy | 14.7 | 44.95 | +30.25 |
| ToolQA-D (Average) | Accuracy | 31.75 | 44.95 | +13.2 |
| Performance in Static Environments (P_c in prompt, P_c on server). Tests if dynamic training hurts static performance. | ||||
| ToolQA-D (Average) | Accuracy | 49.85 | 48.5 | -1.35 |
| Ablation study on module components in OOD dynamic environment. | ||||
| ToolQA-D (Average) | Accuracy | 38.65 | 44.95 | +6.3 |
| ToolQA-D (Average) | Accuracy | 28.6 | 44.95 | +16.35 |