| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance of various LLMs using ReAct in the Tool-Retrieval setting (harder, more realistic setting). | ||||
| ETAPP (Tool-Retrieval) | Procedure (PRC) | 3.70 | 3.82 | +0.12 |
| ETAPP (Tool-Retrieval) | Personalization (PSN) | 3.43 | 3.54 | +0.11 |
| ETAPP (Tool-Retrieval) | Proactivity (PTV) | 1.56 | 1.65 | +0.09 |
| Impact of Fine-Tuning (FT) on Qwen2.5-7B-Instruct. 'ID' = In-Domain (seen user/instruction types), 'OOD' = Out-of-Domain. | ||||
| ETAPP (Subset) | Procedure (PRC) on ID Data | 2.76 | 3.47 | +0.71 |
| ETAPP (Subset) | Proactivity (PTV) on ID Data | 1.35 | 1.99 | +0.64 |
| ETAPP (Subset) | Procedure (PRC) on OOD Data | 2.91 | 3.52 | +0.61 |