API-Bank: A benchmark for tool-augmented LLMs containing an evaluation system with 73 APIs and a training set of 1,888 dialogues.
Lynx: A tool-augmented LLM fine-tuned from Alpaca-7B using the API-Bank training dataset.
Multi-agent: A data generation method proposed in this paper where 5 specialized LLM agents generate domains, APIs, queries, and dialogues.
Plan+Retrieve+Call: The most complex evaluation setting where the model must plan a sequence of steps, search for unknown APIs, and execute them.
ROUGE-L: A metric used to evaluate the quality of the final natural language response by comparing it to a reference.
Alpaca: An instruction-tuned version of the LLaMA-7B model.
Hallucination: In this context, when the model generates an API call for a tool that does not exist or was not retrieved.