Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph

📝 Paper Summary

Tool-use post-training Data synthesis for tool learning

KG2Tool generates high-quality tool-use instruction data by mapping Knowledge Graph relations to APIs and extracting First-Order Logic query pathways as ground-truth solution steps.

Core Problem

Current methods for training LLMs to use tools rely on costly human annotation or unstable LLM-generated data, often resulting in errors, low complexity, and hallucinated solution paths.

Why it matters:

LLM-generated tool data often contains irrelevant tool combinations or incorrect logic, requiring expensive manual verification
Simple prompting strategies produce low-complexity queries that fail to challenge the model's reasoning capabilities
High-quality, verifiable execution traces are essential for robust tool learning but are hard to scale with human annotators

Concrete Example: A standard LLM might generate a query asking about a researcher but hallucinate a non-existent API or return incorrect data. In contrast, this method extracts a verifiable fact chain from a KG—e.g., 'Turing Award winners -> work in -> Deep Learning'—and converts it into an executable API sequence ('get_winners', 'get_intersection') where the ground truth is guaranteed by the graph structure.

Key Novelty

KG-to-Tool Instruction Synthesis (KG2Tool)

Treats Knowledge Graph triples (Head, Relation, Tail) as functional API calls (Input, Function, Output), guaranteeing execution correctness without an external interpreter
Uses First-Order Logic (FOL) templates to sample complex, multi-step subgraph structures, ensuring diverse and logic-heavy query patterns
Generates solution paths by traversing these subgraphs, providing accurate intermediate execution steps for instruction tuning without running actual code

Architecture

The overall framework for generating instruction data from Knowledge Graphs. It shows the pipeline from KG subgraph sampling to instruction formatting.

Evaluation Highlights

ToolLM-14B achieves 87.21 overall score on T-Eval, outperforming the much larger Qwen2.5-72B (86.71) and GPT-4 (86.44)
ToolLM-7B improves by 9.0% over its base model Qwen2.5-7B, surpassing GPT-3.5 (84.05 vs 84.72)
Fine-tuning on just 2,000 synthetic samples yields significant gains, demonstrating high data efficiency compared to larger, noisier datasets

Breakthrough Assessment

7/10

Highly effective method for generating verifiable tool-use data without external APIs. While the scope is limited to KG-style lookup tasks, the performance gains on general benchmarks like T-Eval are impressive for such a small dataset.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) for Tool Use

Inputs: Natural language user query q requiring tool execution

Outputs: Multi-turn conversation containing API calls (Thought, Action, Action Input) and final response

Pipeline Flow

KG Subgraph Sampling (via FOL templates)
API & Query Generation (mapping relations to functions)
Solution Path Execution (traversing the graph to get ground truth)
Instruction Formatting (converting to chat format)

System Modules

FOL Sampler (Data Generation)

Extracts subgraphs matching specific logical patterns (e.g., intersection, union) from the KG to ensure query complexity

Model or implementation: Algorithm-based (subgraph matching)

Translator (Data Generation)

Converts logical relations into API names and FOL queries into natural language questions

Model or implementation: LLM (implied, used for linguistic conversion)

Executor (Data Generation)

Simulates tool execution by querying the KG to get ground-truth intermediate results for every step

Model or implementation: KG Query Engine

Novel Architectural Elements

Mapping KG triples (h, r, t) directly to Tool Executions (input, function, output) to synthesize verifiable execution traces without external code

Modeling

Base Model: Qwen2.5-Instruct series (7B, 14B, 32B, 72B)

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Adaptation: LoRA (rank=16, alpha=32)

Trainable Parameters: LoRA adapters

Training Data:

2,000 samples randomly selected from the generated KG2Tool dataset
Source KG: FB15k (Freebase subset)

Key Hyperparameters:

learning_rate: 0.0001
batch_size: 32
warmup_ratio: 0.1
+ 3 more
scheduler: cosine
lora_rank: 16
lora_alpha: 32

Compute: Not reported in the paper

Comparison to Prior Work

vs. ToolAlpaca/ToolFormer: Guarantees correctness of tool outputs via KG facts rather than LLM simulation/hallucination
vs. ToolHop: Does not rely on expensive GPT-4o calls for the entire generation process; uses KG structure for logic
vs. General SFT: Focuses on complex logic (intersection, union, negation) derived from FOL rather than simple single-step tools

Limitations

Domain constraint: Generated tools are limited to factual lookups (KG relations) and set operations; may not generalize to procedural tools (e.g., sending emails, calculating math).
KG Quality dependence: The correctness of the data relies entirely on the quality and completeness of the source Knowledge Graph (FB15k).
Simulated vs. Real: The 'tools' are proxies for database queries, which might differ behaviorally from messy real-world APIs with latency or errors.

Reproducibility

The paper states the KG2Tool data will be publicly available. The method relies on the FB15k dataset and standard open-source LLMs (Qwen2.5). Exact prompt templates for query generation are mentioned as being in the Appendix.

📊 Experiments & Results

Evaluation Setup

Evaluation on the T-Eval benchmark, which assesses tool use across planning, reasoning, retrieval, understanding, instruction following, and reviewing.

Benchmarks:

T-Eval (Step-by-step tool evaluation (multi-turn))

Metrics:

Overall Score
Plan
Reason
Retrieve
Understand
Instruct
Review
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
T-Eval	Overall Score	77.72	84.72	+7.00
T-Eval	Overall Score	84.05	84.72	+0.67
T-Eval	Overall Score	86.71	87.21	+0.50
T-Eval	Plan (Sub-task)	78.20	84.80	+6.60
T-Eval	Reason (Sub-task)	77.00	86.80	+9.80

Main Takeaways

Small-scale fine-tuning (2k samples) with high-quality synthetic data yields massive improvements (+7-9%) in tool use performance.
Models trained on KG-derived data (ToolLM) can outperform significantly larger models (14B vs 72B) and proprietary models (GPT-3.5) on tool benchmarks.
The method is particularly effective at boosting Reasoning and Planning capabilities, likely due to the complex First-Order Logic templates used in data generation.

📚 Prerequisite Knowledge

Prerequisites

Knowledge Graphs (entities, relations, triples)
First-Order Logic (FOL) queries
Instruction Tuning / Supervised Fine-Tuning (SFT)
Tool-use / Function calling in LLMs

Key Terms

Knowledge Graph (KG): A structured database of facts represented as a graph where nodes are entities and edges are relations (e.g., Alice --friend--> Bob)

First-Order Logic (FOL) query: A formal logic statement using quantifiers (exists, for all) and logical operations (AND, OR, NOT) to retrieve data, used here as a template for generating complex questions

KG2Tool: The proposed dataset/method that converts KG subgraphs into tool-use instruction data

Triple: The basic unit of a KG, consisting of (Subject, Predicate, Object), treated here as (Input, Function, Output)

Relation Projection: An operation finding all tail entities connected to a head entity by a specific relation; mapped here to a standard API call

T-Eval: A benchmark for evaluating LLM tool usage capabilities across planning, reasoning, and retrieval sub-tasks

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices