AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

📝 Paper Summary

Agent Recommendation Agentic Information Retrieval

AgentSelect standardizes heterogeneous evaluations into a unified benchmark for recommending deployable agent configurations (model + tools) based on natural language queries.

Core Problem

The ecosystem lacks a principled way to choose among an exploding space of LLMs and tools; existing benchmarks evaluate components in isolation rather than as deployable configurations.

Why it matters:

Practitioners face a 'jungle of configurations' when building agents, needing to select backbone models and toolsets without guidance
Current evaluation artifacts (leaderboards) are fragmented and diagnostic, failing to provide the query-conditioned supervision needed to train recommenders
Popularity-based recommendation methods fail in the agent domain due to the shift from dense reuse to long-tail, near one-off supervision

Concrete Example: A user wants an agent to 'analyze stock trends and plot a graph'. While separate benchmarks rank LLMs on math and tools on API usage, no single source tells the user which specific combination (e.g., GPT-4 + Matplotlib + YahooFinance) is best for that specific narrative query.

Key Novelty

Unified Narrative Query-to-Agent Recommendation Benchmark

Formalizes agent recommendation as ranking capability profiles (M, T) composed of a backbone Model and Toolset
Converts heterogeneous evaluation artifacts (LLM leaderboards, tool benchmarks) into a standardized positive-only interaction dataset
Synthesizes compositional agents for realistic tasks by retrieving and coupling compatible models and tools, creating pseudo-positive supervision where real data is scarce

Architecture

Overview of the AgentSelect framework, illustrating the pipeline from benchmark construction to recommender training and deployment.

Evaluation Highlights

Constructed a large-scale benchmark with 111,179 queries and 107,721 deployable agents from 40+ sources
Unified 251,103 positive-only query-agent interaction records across LLM-only, toolkit-only, and compositional settings
Aggregated tool usage covering 12,099 unique tools, significantly expanding beyond single-source tool benchmarks like ToolHop (622 tools)

Breakthrough Assessment

9/10

Establishes the first unified infrastructure for the 'last mile' of agent deployment—selection. By converting fragmented leaderboards into training data, it enables a new class of meta-agent systems.

⚙️ Technical Details

Problem Definition

Setting: Query-to-Agent Recommendation (Ranking)

Inputs: Natural language query Q

Outputs: Ranked list of top-k agents from catalog A, where each agent is a tuple (M, T)

Pipeline Flow

User Input: Narrative Query Q
Recommender System (Ranker)
Output: List of Deployable Agents (YAMLs)

System Modules

Recommender System

Rank candidate agents based on estimated utility s(Q, A)

Model or implementation: Not explicitly detailed in text (Benchmark paper)

Novel Architectural Elements

Representation of all candidates as executable (M, T, C) YAML specifications, enabling direct deployment of recommended results

Modeling

Base Model: Varies (Benchmark aggregates many models including Llama-3, GPT-4, etc.)

Training Method: Data Construction Pipeline (not model training)

Training Data:

Part I (LLM-only): 23,073 queries from Open LLM Leaderboard, MMLU, BBH
Part II (Toolkit-only): 76,197 queries from ToolBench, ToolHop, APIBank
Part III (Compositional): 11,909 queries synthesized via retrieval-based composition

Comparison to Prior Work

vs. Open LLM Leaderboard: Recommends full configurations (M, T) rather than just scoring models (M)
vs. ToolBench: Unifies diverse tool benchmarks into a single ranking format; adds backbone selection
vs. RouterBench: Addresses the combinatorial space of tools + models, not just model selection
+ 1 more
vs. ToolRet [not cited in paper]: Focuses on end-to-end agent recommendation (M+T) rather than just retrieving relevant tools (T) for a fixed model

Limitations

Relies on synthesized 'pseudo-positive' interactions for the compositional part, which may not perfectly reflect human preference
Excludes the configuration component 'C' (prompts, temperature) from the recommendation target due to lack of standardization
Performance depends on the quality of underlying source benchmarks (MMLU, ToolBench, etc.)

Reproducibility

Code: https://github.com/Ancientshi/AgentMatch

📊 Experiments & Results

Evaluation Setup

Data Construction Analysis (Paper is a Benchmark Release)

Benchmarks:

AgentSelect (Query-to-Agent Recommendation) [New]

Metrics:

Number of Queries
Number of Agents
Number of Interactions
Tool Diversity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The following statistics describe the scale and composition of the constructed AgentSelect benchmark.
AgentSelect	Total Queries	0	111179	111179
AgentSelect	Total Agents	0	107721	107721
AgentSelect	Total Interactions	0	251103	251103
AgentSelect (Part I)	Sparsity (Agents)	0	231	231
AgentSelect (Part III)	Sparsity (Agents)	0	59541	59541

Experiment Figures

Statistics and topology of the AgentSelect benchmark

Main Takeaways

Agent ecosystem evaluation exhibits a regime shift: dense head reuse in LLM-only tasks vs. long-tail, near one-off supervision in compositional tasks
Traditional popularity-based Collaborative Filtering (CF) and GNN methods become fragile in this long-tail regime
Content-aware capability matching is essential for effective agent recommendation given the sparsity of agent reuse
Synthesized compositional interactions (Part III) are learnable and improve coverage over realistic (M, T) compositions compared to using only component-level data

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM Agents (Backbone + Tools)
Recommender Systems (Implicit Feedback, Ranking)
Information Retrieval (Query-Document Matching)

Key Terms

Capability Profile (M, T): The abstraction of an agent used for recommendation, consisting of a Backbone Model (M) and a set of Tools (T)

YAML configuration: A human-readable data serialization format used here to store executable agent specifications

Positive-only supervision: Training data where only successful or high-quality interactions are recorded, implying preference without explicit negative labels

Implicit feedback: Inferences about user preference drawn from observed actions (like selection or successful execution) rather than explicit ratings

Narrative query: A free-form natural language description of a task or intent (e.g., 'Help me plan a trip to Tokyo') as opposed to a keyword search

Compositional Agents: Agents constructed by explicitly combining a specific LLM backbone with a specific set of tools to solve a complex task

TwoTower model: A neural network architecture for retrieval where query and item are processed by separate encoders, and their similarity is computed via dot product [implied context from abstract]

Coreset: A small, weighted subset of a dataset that approximates the properties of the full dataset, used here to select representative queries