Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection

📝 Paper Summary

Agentic system design Tool selection and discovery

The paper formulates agent composition as an online knapsack problem, using a composer agent to dynamically test and select components that maximize utility within a budget.

Core Problem

Selecting optimal tools or sub-agents from large inventories is difficult because static descriptions rarely match real-world performance, and traditional retrieval ignores cost-utility trade-offs.

Why it matters:

Developers face a 'paradox of choice' with combinatorial explosions of possible agent/tool configurations
Static retrieval methods fail when capability descriptions are opaque or when task requirements shift unpredictably
Selecting components without considering budget constraints leads to inefficient, high-cost systems

Concrete Example: When composing an information-seeking agent, a static retriever might select a specialized scientific search tool based on metadata. However, dynamic testing might reveal that a cheaper, generalized web search tool handles the specific queries equally well, or conversely, that the specialized tool is strictly necessary for medical queries despite the cost.

Key Novelty

Composer Agent with Online Knapsack Optimization

Formalizes component selection as a Knapsack problem: maximizing success probability (value) subject to a budget (weight)
Introduces an 'Online Knapsack Composer' that iteratively tests components in a sandbox environment to estimate their true 'value' (utility) in real-time rather than relying on static embeddings
Uses the ZCL algorithm to make dynamic accept/reject decisions for components based on their empirically determined value-to-cost ratio

Architecture

The workflow of the Online Knapsack Composer.

Evaluation Highlights

Increases multi-agent success rate from 37% to 87% when selecting from an inventory of 100+ agents (compared to baseline performance)
Achieves up to 31.6% success rate improvement in single-agent setups compared to retrieval baselines
Demonstrates up to 80% cost-adjusted performance gains over retrieval-based baselines, consistently lying on the Pareto frontier

Breakthrough Assessment

8/10

Novel application of classical operations research (Online Knapsack) to agentic engineering. The shift from static retrieval to dynamic sandboxing for value estimation addresses a fundamental reliability bottleneck in agent composition.

⚙️ Technical Details

Problem Definition

Setting: Constrained optimization (Knapsack Problem)

Inputs: Target task description x, Budget B, Component Inventory A (with costs c and descriptions d)

Outputs: Optimal subset of components S* that maximizes success probability p(S)

Pipeline Flow

Task Analysis: Composer parses task into required skills
Candidate Retrieval: Retrieve top-K components per skill
Dynamic Evaluation: Iterative sandboxing (Algorithm 1) tests components
Selection: ZCL Algorithm decides inclusion based on test results and budget

System Modules

Skill Generator

Parses the task description into a list of required core skills

Model or implementation: Claude 3.5 Sonnet / Haiku / 3.7 Sonnet (depending on experiment)

Candidate Retriever

Retrieves potentially relevant components for each skill from the inventory

Model or implementation: BGE-Large-English (embedding model)

Sandboxing Engine

Generates test queries and executes components to measure real-world utility

Model or implementation: Same as Composer Model

ZCL Selector

Decides whether to add a component to the final set based on value-to-cost ratio and remaining budget

Model or implementation: Algorithmic (Non-LLM)

Novel Architectural Elements

Online Knapsack Composer: A feedback loop where component 'value' is not static metadata but dynamically computed via active testing (sandboxing)
Integration of ZCL algorithm for budget-aware streaming selection of AI components

Modeling

Base Model: Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3.7 Sonnet (used variously for Composer and Agents)

Compute: Not reported in the paper (Inference-only approach, no training)

Comparison to Prior Work

vs. ToolLLM/Retrieval: Proposed method tests tools dynamically (sandboxing) rather than relying solely on semantic similarity of descriptions
vs. DyLAN/AgentPrune: Formulates selection as a Knapsack problem (budget constraints) rather than graph optimization or redundancy pruning
vs. Offline Knapsack: Estimates value via real-time testing instead of static similarity scores
+ 1 more
vs. GPT-4-based manual selection [not cited in paper]: Fully automated loop that scales to hundreds of components without human intervention

Limitations

Relies on the quality of generated test questions; poor test questions lead to inaccurate utility estimation
Sandboxing adds computational overhead and latency compared to pure retrieval methods
Requires executable environments for all candidate tools (APIs must be callable during selection)
Experiments limited to specific domains (QA, Medical, Travel) and CodeAct/ReAct frameworks

Reproducibility

The paper does not explicitly provide a link to a code repository. It mentions reusing 'smolagents' versions of benchmarks (GAIA, SimpleQA) and using the 'CodeAct' framework. The inventory construction (120 tools) is described in Appendix A.4. The exact prompts for the composer are in Appendix A.2.

📊 Experiments & Results

Evaluation Setup

Task-based evaluation where an agent must solve problems using a composed set of tools/sub-agents

Benchmarks:

GAIA (General AI Assistants (reasoning, tool use))
SimpleQA (Factuality evaluation (short answers))
MedQA (Clinical knowledge (USMLE style))
MAC Benchmarking Dataset (Multi-agent collaboration (Travel, Mortgage domains))

Metrics:

Success Rate
Component Cost ($)
Pareto Frontier position
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Multi-agent experiments demonstrate that the Online Knapsack composer significantly outperforms baselines when selecting from a large inventory of agents.
Single-agent experiments show consistent improvements over retrieval baselines across multiple datasets.
GAIA/MedQA/SimpleQA	Cost-adjusted Performance	Not reported in the paper	Not reported in the paper	+80

Experiment Figures

Pareto frontier plots for Single-Agent Experiments (Success Rate vs. Budget Spent) on GAIA, SimpleQA, and MedQA.

Results for Multi-agent experiments (Success Rate vs Budget) on Travel and Mortgage domains.

Main Takeaways

Online Knapsack Composer consistently lies on the Pareto frontier, offering the best trade-off between success rate and cost across all datasets.
Pure retrieval approaches perform poorly because semantic descriptions often fail to capture the actual executable utility of a tool.
The method scales well to large inventories (100+ agents), where simple 'Identity' (using all available agents) fails due to the complexity of delegation.
Combining Online Knapsack selection with prompt optimization (AvaTaR) yields the highest overall performance ($30 budget setting).

📚 Prerequisite Knowledge

Prerequisites

Knapsack Problem (Combinatorial Optimization)
Retrieval-Augmented Generation (RAG)
Agentic Workflows (Tool use, planning)

Key Terms

Knapsack Problem: An optimization problem where one must select a set of items to maximize total value without exceeding a weight (or budget) limit

ZCL Algorithm: An online algorithm for the knapsack problem that uses dynamic thresholds based on remaining capacity to decide whether to accept an incoming item

CodeAct: A framework where LLM agents generate executable code (e.g., Python) to perform actions and call tools, rather than outputting structured JSON

Pareto frontier: The set of solutions where no improvement can be made to one objective (e.g., accuracy) without worsening another (e.g., cost)

ADAS: Automated Design of Agentic Systems—the broader field of algorithmically creating or configuring agent architectures

Sandboxing: Testing a software component (or agent tool) in an isolated environment to verify its behavior safely before deployment