← Back to Paper List

Small Language Models are the Future of Agentic AI

Peter Belcák, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Lin, Pavlo Molchanov
NVIDIA Research, Georgia Institute of Technology
arXiv.org (2025)
Agent Reasoning Benchmark

📝 Paper Summary

Agentic System Architecture Model Efficiency
Agentic AI should shift from monolithic Large Language Models (LLMs) to specialized Small Language Models (SLMs) because SLMs offer sufficient capability for modular sub-tasks with superior speed, cost, and flexibility.
Core Problem
Most AI agents currently rely on monolithic, generalist Large Language Models (LLMs) for all operations, which is operationally excessive and economically inefficient for the repetitive, narrowly scoped sub-tasks that dominate agentic workflows.
Why it matters:
  • Current LLM-centric agent deployments incur high latency and massive operational costs (estimated $57bn infrastructure investment), making scaling difficult
  • Using generalist models for narrow tasks (like formatting JSON) is a misallocation of computational resources and energy
  • Reliance on centralized cloud LLMs limits privacy, edge deployment, and the ability to rapidly iterate on specialized behaviors
Concrete Example: A typical agent might use a massive 70B+ parameter model just to parse a tool output into JSON format—a task a 2B parameter model could do instantly and cheaply. The paper argues this is like hiring a PhD to do data entry.
Key Novelty
The SLM-First Agentic Paradigm
  • Proposes that agentic systems should be composed primarily of Small Language Models (<10B parameters) serving as specialized experts for distinct sub-tasks
  • Advocates for 'Heterogeneous Agentic Systems' where massive LLMs are only invoked selectively for high-level reasoning or open-ended conversation, while SLMs handle the bulk of operational logic
  • Frames the shift to SLMs not just as optimization, but as a necessary architectural evolution for sustainable and democratized AI agents
Architecture
Architecture Figure Figure 1
Illustration of Heterogeneous Agentic Systems showing how SLMs can replace LLMs in sub-tasks.
Evaluation Highlights
  • Serving a 7bn SLM is 10–30x cheaper (in latency, energy, FLOPs) than a 70–175bn LLM while maintaining real-time responsiveness
  • Microsoft Phi-2 (2.7B) achieves reasoning and code generation scores on par with 30B models while running ~15x faster
  • NVIDIA Hymba-1.5B demonstrates 3.5x greater token throughput than comparably-sized transformer models and outperforms larger 13B models on instruction following
Breakthrough Assessment
7/10
While not proposing a novel model architecture, it strongly articulates a necessary paradigm shift for the industry. It aggregates significant evidence that SLMs are ready to replace LLMs in modular systems, challenging the 'bigger is better' status quo.
×