🔍 Retrieval Models Aren’t Tool-Savvy:
Benchmarking Tool Retrieval for Large Language Models

Brief Introduction

Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on \ours. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.

geometric reasoning

In this work, we focus on two research questions: (i) Are existing information retrieval models are good at tool retrieval? and (ii) To what extent does the tool retrieval quality affect the downstream task pass rate?

The proposed benchmark: ToolRet

To comprehensively evaluate IR models on various tool retrieval scenarios, we introduce ToolRet, the first large-scale tool retrieval benchmark that comprises 7.6k diverse retrieval tasks and a corpus of 43k tools, collected from existing dataset resources. We explain the three key processes in building the ToolRet.

  • Data collection We collect query-tool datasets from the following sources: (i) Tool-use agent benchmarks from published research papers in AI conferences, such as ACL and NeurIPS; (ii) Related conference resources such as AppBench in EMNLP and ToolLens in CIKM; and (iii) Other publicly available datasets from the open-source community, e.g., HuggingFace. The collected data is carefully curated to cover a wide range of practical tool requirements, comprising diverse types of tool documentation, domains, and varying query lengths. Then, we standardize the format of all the collected tasks, aligning them with retrieval tasks similar to the format in MTEB, where each retrieval task contains a query and target tools (e.g., labels).
  • Data sampling After collecting the datasets, we observe data size imbalances across different datasets. Besides, some datasets are extremely large with substantial redundant content, making comprehensive model evaluation both inefficient and unnecessary. Therefore, we streamline them through effective data sampling while maintaining its evaluation integrity.
  • Instruction construction To support the instructional retrieval setting of our benchmark, we also introduce a target-aware strategy to supplement each query with an instruction using the powerful LLMs (i.e., gpt-4o).
geometric reasoning

Statistics of the ToolRet.

Experimental Results

We evaluate a wide range of retrieval models on ToolRet. Our evaluation also supports two main settings, including `w/ inst.` and `w/o inst.`.

Evaluation with non-instructional retrieval setting.
Evaluation with instructional retrieval setting.

Impact of retrieval quality on the downstream tasks

We also qualitatively evaluate and analyze the impact of retrieval quality on the downstream task-solving pass rate. We conduct different experiments on ToolBench where the tool-use LLM is paired with toolset retrieved by various retrieval model or pre-annotated by offical datasets.

downstream