Bowen Fang, Wen Ye, Yunyue Su, Jinghao Zhang, Qiang Liu, Yesheng Liu, Xin Sun, Shu Wu, Jiabing Yang, Baole Wei, Liang Wang
New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences,
School of Artificial Intelligence, University of Chinese Academy of Sciences,
Zhongguancun Academy,
Zhongguancun Institute of Artificial Intelligence
arXiv
(2026)
AgentRecommendation
📝 Paper Summary
Tool-use post-trainingGenerative tool selection
ToolWeaver replaces unique tool tokens with hierarchical code sequences generated via collaborative-aware quantization, enabling LLMs to scale to thousands of tools while learning collaborative relationships from dense code co-occurrences.
Core Problem
Assigning a unique token to every tool causes vocabulary explosion and creates a semantic bottleneck, as the model must infer relationships from the sparse co-occurrence of isolated IDs.
Why it matters:
Linear vocabulary growth (one token per tool) is unsustainable for massive tool libraries (e.g., 47,000 tools), requiring huge memory and disrupting pre-trained knowledge
Sparse co-occurrence of unique IDs prevents models from learning that certain tools (e.g., Weather and Air Quality) should work together, leading to incomplete reasoning
Existing retrieval methods are complex and disconnected from the LLM's reasoning, while current generative methods fail to generalize to new tools without retraining embeddings
Concrete Example:For the query 'is it a good day to take my kid to the park?', a model needs both 'Realtime Weather' and 'Air Quality'. If these are represented as isolated tokens <Tool_42> and <Tool_99> that rarely appear together, the model may select Weather but miss Air Quality. ToolWeaver groups them under a shared parent code (e.g., <Conditions>) so the model learns the category association.
Key Novelty
Collaborative-Aware Structured Tokenization
Represents each tool not as a single ID, but as a sequence of discrete codes (e.g., <T1_1><T2_1>) from hierarchical codebooks, reducing vocabulary expansion from linear to logarithmic
Injects collaborative signals (tool co-usage patterns) directly into the tokenization process using a Graph Laplacian regularizer, forcing functionally related tools to share code prefixes
Solves the 'index collision' problem in quantization using Optimal Transport (Sinkhorn-Knopp) to ensure every tool has a unique code sequence without breaking semantic structure
Architecture
The structured tokenization process of ToolWeaver. It visualizes how a tool is mapped to a sequence of codes from multiple codebooks.
Breakthrough Assessment
8/10
Addresses the fundamental scalability limit of generative tool use (vocabulary explosion) while simultaneously solving the semantic sparsity problem. The shift from atomic to compositional tool tokens is a significant architectural advance.
⚙️ Technical Details
Problem Definition
Setting: Generative tool selection and execution where tools are treated as tokens in the language generation process
Inputs: User query q and a large tool corpus D
Outputs: A sequence of tool identifiers (hierarchical codes), parameters, and execution triggers
Purpose: Enforce uniform usage of codes to prevent collisions.
Formally: Optimal Transport constraint using Sinkhorn-Knopp algorithm
Purpose: Align LLM to generate codes.
Formally: Standard Next Token Prediction loss on code tokens
Training Data:
ToolBench dataset (nearly 47,000 tools)
Compute: Not reported in the paper
Comparison to Prior Work
vs. ToolGen: Uses compositional codes (logarithmic vocab) vs. unique tokens (linear vocab); ToolWeaver explicitly models collaborative signals in tokenization
vs. ToolLLM: End-to-end generative selection vs. multi-stage retrieval pipeline
vs. LC-Rec: Incorporates Graph Laplacian directly into quantization for tool co-use, whereas LC-Rec focuses on user-item history [not cited in paper as direct baseline, but methodologically similar]
Limitations
Requires pre-computation of the co-occurrence matrix, which may be sparse or unavailable for entirely new toolsets
Hierarchical codes increase the sequence length (L tokens per tool instead of 1), potentially increasing inference latency slightly
Constraint-based decoding (Trie) is required to prevent generating invalid code combinations
Publicly available code at https://github.com/Fwibo/ToolWeaver. Data mentioned is ToolBench. Specific hyperparameters for the Llama-3 fine-tuning (batch size, LR) are not in the provided text snippet.
📊 Experiments & Results
Evaluation Setup
Tool selection and execution on the ToolBench benchmark
Benchmarks:
ToolBench (Instruction tuning for tool use)
Metrics:
Not explicitly reported in the paper
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
ToolWeaver scales to nearly 47,000 tools with logarithmic vocabulary expansion, avoiding the memory overhead of one-token-per-tool methods.
The method significantly outperforms state-of-the-art baselines (ToolLLM, ToolGen) in complex task completion, though specific numeric scores are not extractable from the provided text.
Collaborative-aware tokenization allows the model to infer relationships between tools (e.g., weather and air quality) that rarely appear together in the training data, overcoming the sparsity problem.
📚 Prerequisite Knowledge
Prerequisites
Vector Quantization (specifically RQ-VAE)
Large Language Model (LLM) tokenization
Graph Laplacian regularization
Key Terms
RQ-VAE: Residual-Quantized Variational AutoEncoder—a method to compress vectors into a sequence of discrete codes by recursively quantizing the residual error
codebook: A fixed set of learned vectors (centroids) used in quantization; indices into this set form the discrete 'tokens' for tools
Sinkhorn-Knopp: An algorithm used here to enforce a uniform distribution of tools across codebook entries, preventing multiple tools from collapsing into the same code
Graph Laplacian regularization: A loss function term that penalizes large distances between the representations of nodes (tools) that are connected (co-occur) in a graph
generative alignment: Fine-tuning the LLM to output the specific new tokens (codes) representing tools, rather than just selecting from existing text
trie: A prefix tree data structure used during inference to constrain the LLM's generation to only valid sequences of tool codes