ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models

📝 Paper Summary

Tool-use post-training Generative tool selection

ToolWeaver replaces unique tool tokens with hierarchical code sequences generated via collaborative-aware quantization, enabling LLMs to scale to thousands of tools while learning collaborative relationships from dense code co-occurrences.

Core Problem

Assigning a unique token to every tool causes vocabulary explosion and creates a semantic bottleneck, as the model must infer relationships from the sparse co-occurrence of isolated IDs.

Why it matters:

Linear vocabulary growth (one token per tool) is unsustainable for massive tool libraries (e.g., 47,000 tools), requiring huge memory and disrupting pre-trained knowledge
Sparse co-occurrence of unique IDs prevents models from learning that certain tools (e.g., Weather and Air Quality) should work together, leading to incomplete reasoning
Existing retrieval methods are complex and disconnected from the LLM's reasoning, while current generative methods fail to generalize to new tools without retraining embeddings

Concrete Example: For the query 'is it a good day to take my kid to the park?', a model needs both 'Realtime Weather' and 'Air Quality'. If these are represented as isolated tokens <Tool_42> and <Tool_99> that rarely appear together, the model may select Weather but miss Air Quality. ToolWeaver groups them under a shared parent code (e.g., <Conditions>) so the model learns the category association.

Key Novelty

Collaborative-Aware Structured Tokenization

Represents each tool not as a single ID, but as a sequence of discrete codes (e.g., <T1_1><T2_1>) from hierarchical codebooks, reducing vocabulary expansion from linear to logarithmic
Injects collaborative signals (tool co-usage patterns) directly into the tokenization process using a Graph Laplacian regularizer, forcing functionally related tools to share code prefixes
Solves the 'index collision' problem in quantization using Optimal Transport (Sinkhorn-Knopp) to ensure every tool has a unique code sequence without breaking semantic structure

Architecture

The structured tokenization process of ToolWeaver. It visualizes how a tool is mapped to a sequence of codes from multiple codebooks.

Breakthrough Assessment

8/10

Addresses the fundamental scalability limit of generative tool use (vocabulary explosion) while simultaneously solving the semantic sparsity problem. The shift from atomic to compositional tool tokens is a significant architectural advance.

⚙️ Technical Details

Problem Definition

Setting: Generative tool selection and execution where tools are treated as tokens in the language generation process

Inputs: User query q and a large tool corpus D

Outputs: A sequence of tool identifiers (hierarchical codes), parameters, and execution triggers

Pipeline Flow

Pre-computation: Tool Documentation -> Semantic Embedding -> Collaborative-Aware Quantization -> Hierarchical Codes
Inference: Query -> LLM -> Generates Code Sequence (guided by Trie) -> Tool Lookup -> Execution

System Modules

Semantic Encoder (Representation Learning (Pre-computation))

Encode tool name and description into a dense vector

Model or implementation: Pretrained text encoder (e.g., BERT-based)

Collaborative Tokenizer (Representation Learning (Pre-computation))

Quantize embeddings into hierarchical discrete codes using collaborative signals

Model or implementation: RQ-VAE with Graph Laplacian Regularization

Generator (Inference)

Generate reasoning, tool code sequences, and parameters

Model or implementation: Llama-3-8B (fine-tuned)

Constrained Decoder (Inference)

Ensure generated tokens form valid tool identifiers

Model or implementation: Prefix Tree (Trie) Constrained Beam Search

Novel Architectural Elements

Representation of tools as hierarchical code sequences rather than atomic tokens
Integration of Graph Laplacian loss into the VQ-VAE codebook learning process to enforce collaborative clustering

Modeling

Base Model: Llama-3-8B

Training Method: Generative Alignment (Fine-tuning)

Objective Functions:

Purpose: Train the tokenizer to minimize reconstruction error and align with collaborative graph.

Formally: L = L_recon + L_quant + λ * L_laplacian (where L_laplacian = Trace(Z^T L_graph Z))
Purpose: Enforce uniform usage of codes to prevent collisions.

Formally: Optimal Transport constraint using Sinkhorn-Knopp algorithm
Purpose: Align LLM to generate codes.

Formally: Standard Next Token Prediction loss on code tokens

Training Data:

ToolBench dataset (nearly 47,000 tools)

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolGen: Uses compositional codes (logarithmic vocab) vs. unique tokens (linear vocab); ToolWeaver explicitly models collaborative signals in tokenization
vs. ToolLLM: End-to-end generative selection vs. multi-stage retrieval pipeline
vs. LC-Rec: Incorporates Graph Laplacian directly into quantization for tool co-use, whereas LC-Rec focuses on user-item history [not cited in paper as direct baseline, but methodologically similar]

Limitations

Requires pre-computation of the co-occurrence matrix, which may be sparse or unavailable for entirely new toolsets
Hierarchical codes increase the sequence length (L tokens per tool instead of 1), potentially increasing inference latency slightly
Constraint-based decoding (Trie) is required to prevent generating invalid code combinations

Reproducibility

Code: https://github.com/Fwibo/ToolWeaver

Publicly available code at https://github.com/Fwibo/ToolWeaver. Data mentioned is ToolBench. Specific hyperparameters for the Llama-3 fine-tuning (batch size, LR) are not in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Tool selection and execution on the ToolBench benchmark

Benchmarks:

ToolBench (Instruction tuning for tool use)

Metrics:

Not explicitly reported in the paper
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

ToolWeaver scales to nearly 47,000 tools with logarithmic vocabulary expansion, avoiding the memory overhead of one-token-per-tool methods.
The method significantly outperforms state-of-the-art baselines (ToolLLM, ToolGen) in complex task completion, though specific numeric scores are not extractable from the provided text.
Collaborative-aware tokenization allows the model to infer relationships between tools (e.g., weather and air quality) that rarely appear together in the training data, overcoming the sparsity problem.

📚 Prerequisite Knowledge

Prerequisites

Vector Quantization (specifically RQ-VAE)
Large Language Model (LLM) tokenization
Graph Laplacian regularization

Key Terms

RQ-VAE: Residual-Quantized Variational AutoEncoder—a method to compress vectors into a sequence of discrete codes by recursively quantizing the residual error

codebook: A fixed set of learned vectors (centroids) used in quantization; indices into this set form the discrete 'tokens' for tools

Sinkhorn-Knopp: An algorithm used here to enforce a uniform distribution of tools across codebook entries, preventing multiple tools from collapsing into the same code

Graph Laplacian regularization: A loss function term that penalizes large distances between the representations of nodes (tools) that are connected (co-occur) in a graph

generative alignment: Fine-tuning the LLM to output the specific new tokens (codes) representing tools, rather than just selecting from existing text

trie: A prefix tree data structure used during inference to constrain the LLM's generation to only valid sequences of tool codes