Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning

📝 Paper Summary

RL-based tool use Tool-use post-training

Tool-N1 trains language models to reason and use tools using rule-based reinforcement learning with a binary reward for format and functional correctness, eliminating the need for annotated reasoning trajectories.

Core Problem

Current tool-use training relies on supervised fine-tuning (SFT) with distilled trajectories, which often leads to superficial imitation of reasoning rather than genuine decision-making skills.

Why it matters:

SFT models struggle to generalize beyond their training data, merely mimicking surface-level patterns
Existing pipelines require expensive curation of step-by-step reasoning traces from stronger teacher models
Models often fail to internalize the decision-making process, leading to fragile tool-calling behaviors

Concrete Example: In supervised training, a model might learn to output a specific tool call sequence for a query like 'stock price of Apple' because it memorized the training example. If the argument order changes or the query is slightly different, the model fails. Tool-N1's RL approach rewards the functional result (dictionary match), allowing the model to learn that `{"symbol": "AAPL", "market": "US"}` is valid regardless of argument order.

Key Novelty

Rule-Based Reinforced Reasoning for Tool Use (Tool-N1)

Apply R1-style reinforcement learning (GRPO) to tool calling, using a binary reward that checks only formatting and functional correctness (arguments/tool name)
Eliminate the need for ground-truth reasoning traces during training; the model self-generates and optimizes its own reasoning process via trial and error
Demonstrate that pure RL on tool outcomes outperforms the standard 'SFT followed by RL' pipeline for tool-use tasks

Architecture

Overview of the training pipeline for Nemotron-Research-Tool-N1. It contrasts the standard SFT approach with the proposed RL approach.

Evaluation Highlights

Tool-N1-14B outperforms GPT-4o on BFCL (85.97% vs 83.97%) and API-Bank (82.19% vs 77.16%) benchmarks
Tool-N1-7B surpasses the closed-source GPT-4o on BFCL overall accuracy (84.82% vs 83.97%)
Pure RL training achieves higher accuracy (83.24%) than the standard SFT-then-RL pipeline (83.17%) on the ToolACE dataset

Breakthrough Assessment

8/10

Successfully applies the 'DeepSeek-R1' reasoning paradigm to tool use, showing that SFT on reasoning traces is unnecessary and potentially suboptimal compared to pure RL. Strong empirical results outperforming GPT-4o.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn tool-use interaction where an LLM selects actions from a set of available tools Z

Inputs: Historical context c_t (past actions/observations) and available tools Z

Outputs: Action a_t (one or more tool calls) and associated reasoning text

Pipeline Flow

Input Processing (User Query + Tool Definitions)
Reasoning Generation (Policy Model)
Tool Call Generation (Policy Model)
Reward Calculation (Environment/Rule-based)

System Modules

Policy Model

Generate reasoning thoughts and specific tool calls

Model or implementation: Qwen2.5-Instruct (7B/14B) or LLaMA-3.1-Instruct

Reward Function

Evaluate the correctness of the generated output for RL updates

Model or implementation: Rule-based Python script

Novel Architectural Elements

Integration of structural verification (XML tag checks) directly into the RL reward loop for tool use
Dictionary-based matching logic in reward function that permits argument reordering (unlike string-matching SFT)

Modeling

Base Model: Qwen2.5-7B-Instruct / Qwen2.5-14B-Instruct

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to reference model.

Formally: L_GRPO(θ) = E[min(ρ_i A_i, clip(ρ_i, 1-ε, 1+ε)A_i) - β * KL(π_θ || π_old)]

Training Data:

xLAM (60,000 single-turn samples)
ToolACE subset (8,183 single-turn, 1,470 multi-turn samples)
Filtered to remove invalid tool calls and format inconsistencies

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 1024
temperature: 0.7
+ 2 more
kl_coefficient: 1e-3
entropy_coefficient: 0

Compute: Training on 4 nodes, each with 8 NVIDIA H100 80GB GPUs

Comparison to Prior Work

vs. ToolACE/xLAM: Tool-N1 uses RL (GRPO) with rule-based rewards instead of SFT on static traces
vs. DeepSeek-R1: Adapts the R1 paradigm specifically for tool-calling actions rather than math/code solutions
vs. Hammer: Achieves better performance with a general RL framework rather than specialized function masking techniques

Limitations

Experiments primarily focus on single-turn scenarios (BFCL, API-Bank, ACEBench exclusions)
Performance gains on smaller models (0.5B, 1.5B) are limited compared to larger models
Relies on ground truth tool calls for reward calculation, limiting application to open-ended tasks without clear correct answers

Reproducibility

Code: https://github.com/NVlabs/Tool-N1

Code is publicly available at https://github.com/NVlabs/Tool-N1. Training data is derived from open datasets (xLAM, ToolACE). Base models are open weights (Qwen, LLaMA).

📊 Experiments & Results

Evaluation Setup

Single-turn and multi-turn tool calling evaluation against ground truth

Benchmarks:

Berkeley Function Call Leaderboard (BFCL) (Diverse tool calling (Simple, Multiple, Parallel))
API-Bank (Tool-augmented LLM evaluation)
ACEBench (Tool learning benchmark)

Metrics:

Accuracy (success rate of tool calls)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Tool-N1 models consistently outperform proprietary and open-source baselines on the primary BFCL benchmark.
BFCL (Overall)	Accuracy	83.97	84.82	+0.85
BFCL (Overall)	Accuracy	83.97	85.97	+2.00
BFCL (Overall)	Accuracy	81.88	84.82	+2.94
Tool-N1 demonstrates strong generalization on additional benchmarks API-Bank and ACEBench.
API-Bank	Accuracy	77.16	82.19	+5.03
ACEBench	Accuracy	82.34	87.00	+4.66
Ablation study on training recipes reveals pure RL outperforms pipelines involving SFT.
ToolACE (Subset)	Average Accuracy	82.71	83.24	+0.53
ToolACE (Subset)	Average Accuracy	83.17	83.24	+0.07

Experiment Figures

Learning curves for Tool-N1-7B and Tool-N1-14B showing KL divergence, response length, and validation accuracy over training steps.

Main Takeaways

Pure RL with rule-based rewards is sufficient and superior to SFT for learning tool-calling reasoning; annotated reasoning traces are not strictly necessary.
Binary rewards (all-or-nothing) perform better than fine-grained partial rewards, likely by reducing reward hacking.
The method scales effectively: larger models (7B/14B) show significant gains from RL, while smaller models (0.5B/1.5B) show limited improvement.
Enforcing a structured reasoning format (<think> tags) is crucial; removing this constraint significantly degrades performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO or GRPO)
Tool-using Large Language Models (function calling)
Supervised Fine-Tuning (SFT)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group scores rather than a separate value network to reduce training cost

R1-style RL: Reinforcement learning paradigm inspired by DeepSeek-R1, focusing on rule-based rewards for final answers to induce reasoning without intermediate supervision

SFT: Supervised Fine-Tuning—training a model on labeled examples of inputs and desired outputs

BFCL: Berkeley Function Call Leaderboard—a benchmark for evaluating LLM tool-calling capabilities

KL divergence: A statistical measure used in RL to prevent the trained model from deviating too far from the reference model

Reasoning Trace: Step-by-step textual explanation generated by the model before taking an action (e.g., inside <think> tags)