Evaluation Setup
Task-based evaluation where an agent must solve problems using a composed set of tools/sub-agents
Benchmarks:
- GAIA (General AI Assistants (reasoning, tool use))
- SimpleQA (Factuality evaluation (short answers))
- MedQA (Clinical knowledge (USMLE style))
- MAC Benchmarking Dataset (Multi-agent collaboration (Travel, Mortgage domains))
Metrics:
- Success Rate
- Component Cost ($)
- Pareto Frontier position
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Multi-agent experiments demonstrate that the Online Knapsack composer significantly outperforms baselines when selecting from a large inventory of agents. |
| Single-agent experiments show consistent improvements over retrieval baselines across multiple datasets. |
| GAIA/MedQA/SimpleQA |
Cost-adjusted Performance |
Not reported in the paper |
Not reported in the paper |
+80
|
Main Takeaways
- Online Knapsack Composer consistently lies on the Pareto frontier, offering the best trade-off between success rate and cost across all datasets.
- Pure retrieval approaches perform poorly because semantic descriptions often fail to capture the actual executable utility of a tool.
- The method scales well to large inventories (100+ agents), where simple 'Identity' (using all available agents) fails due to the complexity of delegation.
- Combining Online Knapsack selection with prompt optimization (AvaTaR) yields the highest overall performance ($30 budget setting).