AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

AgentSelect is a novel benchmark from Carnegie Mellon University and University of Washington researchers that reframes AI agent selection as a recommendation problem. The dataset contains 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from over 40 existing sources. It addresses the critical gap in evaluating end-to-end agent configurations (model + toolkit) rather than components in isolation.

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Researchers from Carnegie Mellon University and the University of Washington have introduced AgentSelect, a novel benchmark designed to solve a critical bottleneck in the AI agent ecosystem: the lack of a systematic, data-driven method for selecting the optimal agent configuration for a given task. This work reframes agent selection as a recommendation problem, providing the first large-scale, unified dataset to train and evaluate models that can match user queries with the most effective combination of a backbone LLM and a set of tools.

Key Takeaways

  • AgentSelect is a new benchmark comprising 111,179 queries, 107,721 deployable agents, and 251,103 interaction records aggregated from over 40 existing sources.
  • It addresses the critical gap in evaluating end-to-end agent configurations (model + toolkit) rather than components in isolation, framing the problem as query-to-agent recommendation.
  • The dataset spans LLM-only, toolkit-only, and compositional agents, systematically converting heterogeneous evaluation data into unified, positive-only interaction records.
  • Analysis reveals a "long-tail" data regime where traditional collaborative filtering methods fail, necessitating content-aware, capability-matching approaches.
  • Models trained on AgentSelect successfully transfer to a real-world agent marketplace (MuleRun), demonstrating practical utility and improved recommendation coverage.

Introducing the AgentSelect Benchmark

The core innovation of AgentSelect is its systematic aggregation and reformatting of disparate evaluation data. The AI research community is flooded with specialized benchmarks—from coding (HumanEval) and reasoning (MMLU) to tool-use (ToolBench)—but each evaluates a narrow slice of capability. AgentSelect's creators have ingested outputs from over 40 such sources, including popular leaderboards and agent frameworks, to build a massive corpus of 251,103 "interactions." Each interaction links a specific user query (e.g., "analyze this CSV file") to a successful agent configuration that solved it, defined by its backbone LLM (like GPT-4 or Claude 3) and the specific tools it employed.

This creates a unified, query-conditioned dataset where the "label" is a proven-effective agent, not just a performance score. The benchmark's scale is significant: with over 107,000 unique agent configurations derived from the compositional space of models and tools, it captures the explosive complexity facing developers. The dataset is structured into three parts: LLM-only interactions (model capabilities alone), toolkit-only (tool efficacy), and the novel Part III: synthesized compositional interactions. This final part is key, as it uses logical inference to create plausible, successful agent compositions from base interactions, dramatically expanding coverage of the realistic configuration space.

Industry Context & Analysis

AgentSelect arrives at a pivotal moment. The market for AI agents is fragmenting rapidly, with platforms like OpenAI's GPTs, LangChain, AutoGPT, and CrewAI offering myriad ways to assemble models and tools. Yet, selection remains an art, not a science. Unlike traditional model selection, which can rely on aggregated leaderboard scores (e.g., GPT-4 consistently topping MMLU at ~86.4%), agent selection is a multi-dimensional optimization problem. An agent's performance is non-linear; a top-tier model like Claude 3 Opus paired with a poorly chosen tool can fail, while a smaller, cheaper model like Llama 3 70B with the perfect specialized tool can succeed brilliantly.

The research uncovers a fundamental shift in the data landscape that invalidates common approaches. Analysis shows a "regime shift from dense head reuse to long-tail, near one-off supervision." In simpler terms, while a few popular agent configurations (the "head") are used frequently, the vast majority of effective configurations are unique or rarely repeated (the "long tail"). This finding has direct commercial implications. It means that popularity-based recommendation systems—similar to how Netflix or Amazon recommends items—which rely on collaborative filtering (CF) or Graph Neural Networks (GNNs), become fragile and ineffective. You cannot recommend a niche, perfect-fit agent based on what "similar users" used, because there may be no similar users.

Instead, the path forward is content-aware capability matching. The system must understand the semantic content of the user's query and the functional "capability profile" of an agent (e.g., "can call the Wikipedia API," "excels at mathematical reasoning"). This aligns with a broader industry trend toward retrieval-augmented generation (RAG) for knowledge, but applied to tool retrieval and orchestration. The proven transfer learning to MuleRun, a real agent marketplace, validates this approach. It shows that models trained on AgentSelect's principled data can generalize to unseen, real-world catalogs, moving beyond academic exercise to deployable technology.

What This Means Going Forward

The immediate beneficiaries of this research are AI platform builders and enterprise developers. Companies building agent-hosting marketplaces or internal automation platforms can use the AgentSelect methodology to build sophisticated "agent recommender systems," drastically reducing the trial-and-error time for users. This could become a key differentiator in the competitive landscape between platforms like Microsoft's Copilot Studio, Google's Vertex AI Agent Builder, and open-source frameworks.

For the research community, AgentSelect provides the first reproducible foundation for a new subfield: agent-oriented learning to rank. It will enable apples-to-apples comparisons of different recommendation algorithms, moving beyond anecdotal evidence. Future work will likely focus on improving the synthesis of compositional data and incorporating cost and latency profiles into the recommendation, making it economically grounded.

The most significant change will be the gradual codification of agent selection. As benchmarks drive progress, we can expect the emergence of "agent selection models" that sit atop model hubs and tool registries. Watch for integrations with popular platforms and whether major cloud providers (AWS, GCP, Azure) begin to offer agent recommendation as a managed service. The long-term vision is a future where describing a task in plain language automatically summons and orchestrates the perfect ensemble of AI models and digital tools to execute it, with AgentSelect providing the critical training data to make that vision reliable and scalable.

常见问题