| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Main comparison on ToolBench showing ConAgents superiority over baselines across different difficulty levels (I1, I2, I3). Pass Rate (PR) is the primary metric. | ||||
| ToolBench (I1 - Easy) | Pass Rate | 56.4 | 60.0 | +3.6 |
| ToolBench (I2 - Medium) | Pass Rate | 53.6 | 62.4 | +8.8 |
| ToolBench (I3 - Hard) | Pass Rate | 50.0 | 60.7 | +10.7 |
| Results for open-source models (Llama-2-13B) enhanced with SPAN distillation compared to monolithic baselines. | ||||
| ToolBench (Avg) | Pass Rate | 47.9 | 52.3 | +4.4 |
| Ablation studies on the Review Agent's impact. | ||||
| ToolBench | Pass Rate | 54.6 | 61.0 | +6.4 |