← Back to Paper List

Small Language Models for Efficient Agentic Tool Calling: Outperforming Large Models with Targeted Fine-tuning

P Jhandi, O Kazi, S Subramanian, N Sendas
arXiv, 12/2025 (2025)
Agent Reasoning Benchmark

📝 Paper Summary

Tool-use post-training Small Language Models (SLMs)
A 350M parameter Small Language Model, when fine-tuned on high-quality tool-use data, significantly outperforms much larger models like ChatGPT-CoT on specialized agentic tool-calling tasks.
Core Problem
Running state-of-the-art LLMs for routine tool-calling tasks is cost-prohibitive and computationally inefficient due to their massive size and lack of specialization.
Why it matters:
  • High infrastructure costs and latency of large models prevent widespread adoption of generative AI in mission-critical production systems
  • General-purpose models often overgeneralize or hallucinate when precise, structured API calling formats are required
  • Reliance on closed APIs introduces data privacy risks and operational dependencies
Concrete Example: Large generalist models often generate verbose explanations or attempt creative solutions when a strict Thought-Action-Action Input format is required for an API call, whereas the proposed SLM learns to suppress this irrelevant behavior.
Key Novelty
Targeted Supervised Fine-Tuning of Small Language Models (SLMs)
  • Demonstrates that a very small model (350M parameters) can become a domain expert in tool calling through single-epoch fine-tuning on high-quality instruction data
  • Replaces the 'scaling law' approach with a 'behavioral focus' approach, where limited capacity is dedicated entirely to learning structured reasoning and API patterns rather than general knowledge
Evaluation Highlights
  • Achieved 77.55% pass rate on ToolBench, outperforming ChatGPT-CoT (26.00%) by a massive margin
  • Surpassed ToolLLaMA-DFS (30.18%) and ToolLLaMA-CoT (16.27%) despite having significantly fewer parameters
  • Maintained consistent performance (74% - 80.5%) across all six ToolBench complexity categories, showing robust generalization to unseen tools and instructions
Breakthrough Assessment
8/10
The sheer magnitude of the performance gap (77% vs 26%) with such a tiny model (350M) challenges the prevailing assumption that complex reasoning requires massive parameters, offering a viable path for cheap, local agentic AI.
×