ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

📝 Paper Summary

Reward Modeling Agentic AI Tool Use / Function Calling

ToolRM is a suite of outcome reward models trained on synthetically generated incorrect tool calls to accurately evaluate and improve tool-use performance in large language models.

Core Problem

Existing reward models are designed primarily for natural language chat and struggle to detect the nuances of tool-based reasoning, such as subtle parameter errors or missing arguments.

Why it matters:

Current general-purpose reward models frequently miss key signals of effective tool use, leading to poor alignment in agentic workflows
There is no dedicated benchmark for evaluating reward models specifically on function-calling tasks, making it difficult to quantify improvements
Reliable automated evaluation is critical for scaling training techniques like reinforcement learning and rejection sampling without human labeling

Concrete Example: A model might generate a tool call with an incorrect parameter value or missing optional parameter (e.g., calling 'search' without a required 'query' argument). General reward models often score this highly because it looks like a valid function call structure, whereas ToolRM is trained to reject these specific subtle errors.

Key Novelty

ToolRM (Tool Outcome Reward Model)

Trains a specialized outcome reward model (ORM) specifically for function calling by contrasting correct ground-truth calls against incorrect calls generated by a diverse pool of open-weight models
Introduces FC-RewardBench, a dataset of 1500 difficult pairwise comparisons derived from the Berkeley Function Calling Leaderboard to rigorously test reward model sensitivity to tool errors
Demonstrates that an RM trained on this domain-specific synthetic data can significantly boost inference performance via Best-of-N sampling

Evaluation Highlights

+24.9% improvement in accuracy for Qwen3-0.6B on downstream tool benchmarks using ToolRM for Best-of-32 sampling compared to greedy decoding
ToolRM-1.5B outperforms much larger models (including gpt-oss-120B) on the proposed FC-RewardBench evaluation dataset
Data filtering using ToolRM enables training fine-tuned models that outperform baselines while using only 50% of the training data

Breakthrough Assessment

8/10

Addresses a critical gap in agentic AI (reward modeling for tools) with a comprehensive solution: a new benchmark, a scalable synthetic data method, and strong empirical results showing significant gains on top of strong base models.

⚙️ Technical Details

Problem Definition

Setting: Pairwise preference modeling for function calling

Inputs: User query x, tool catalog, conversation history, and a candidate tool call y

Outputs: Scalar reward score r(x,y) indicating the quality/correctness of the tool call

Pipeline Flow

Group: Data Generation (Offline)
Prompt diverse LLMs (0.5B-32B) to generate tool calls for open datasets
Filter for incorrect calls (Outcome != Ground Truth)
Construct pairs (Ground Truth vs. Incorrect)
Group: Inference (Best-of-N)
Generator produces N candidate tool calls
ToolRM scores all N candidates
Selector picks candidate with max reward

System Modules

Generator Pool (Data Gen)

Generate representative incorrect tool calls for training

Model or implementation: Ensemble of 11 models (Qwen2.5, Granite, Mistral, etc.)

ToolRM

Evaluate the correctness of a generated tool call given the context

Model or implementation: Qwen-2.5-Instruct (1.5B, 7B, or 14B variants) with linear head

Novel Architectural Elements

Specialized ORM input format incorporating tool schemas and tool calls explicitly for reward computation

Modeling

Base Model: Qwen-2.5-Instruct (1.5B, 7B, 14B)

Training Method: Reward Modeling with Bradley-Terry objective

Objective Functions:

Purpose: Maximize the likelihood of the preferred (correct) tool call having a higher score than the incorrect one.

Formally: L = -log(sigmoid(r(x, y_plus) - r(x, y_minus)))
Purpose: Keep rewards zero-centered for stability.

Formally: Regularization term + gamma * (r(x, y_plus)^2 + r(x, y_minus)^2)

Training Data:

180K total samples
85K single-turn (API-Gen)
85K multi-turn (SGD)
10K irrelevance (xlam-irrelevance)
Incorrect samples generated by 11 open-weight models (0.5B to 32B parameters)

Key Hyperparameters:

learning_rate: 1e-6
epochs: 1
scheduler: cosine with 3% warmup
+ 1 more
reward_centering_coefficient: 0.01

Compute: Not reported in the paper

Comparison to Prior Work

vs. General RMs: ToolRM is trained specifically on tool-call errors (schema violations, wrong params), whereas general RMs miss these.
vs. LLM-as-a-Judge: ToolRM (7B) outperforms much larger judges (70B+) and is more computationally efficient.
vs. Themis: ToolRM focuses on the correctness of the tool call itself as an outcome, rather than using tools to verify a natural language claim.

Limitations

Diminishing returns for very large generator models (improvements for 32B+ models are modest compared to small models)
Reliance on synthetic data generation may bias the reward model towards errors produced by specific open-weight models used in the pool
Focuses on outcome rewards (ORMs); does not explicitly model process rewards (PRMs) for the reasoning steps leading to the tool call

Reproducibility

The paper describes the data generation process in detail (datasets used, models used for generation, obfuscation strategy). The exact prompts are referenced in Appendix A.3. Code and trained weights are not explicitly linked in the main text.

📊 Experiments & Results

Evaluation Setup

Reward Model accuracy evaluation and Downstream Tool-Use enhancement via Rejection Sampling

Benchmarks:

FC-RewardBench (Reward Model Evaluation (Pairwise Comparison)) [New]
Berkeley Function Calling Leaderboard (BFCL) v3 (Tool Use / Function Calling)
API-Bank (Tool Use / Dialogue)
ToolAlpaca (Tool Use)
NexusRaven API Evaluation (Tool Use)

Metrics:

Accuracy (pairwise preference)
Full Sequence Matching (correct tool name + args)
AST-based Accuracy (Abstract Syntax Tree matching for BFCL)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Best-of-N (N=32) experiments show massive gains for smaller models using ToolRM selection, with diminishing returns for larger models.
Average across 5 benchmarks (API-Bank, ToolAlpaca, etc.)	Accuracy	39.5	64.4	+24.9
Average across 5 benchmarks (API-Bank, ToolAlpaca, etc.)	Accuracy	64.9	70.5	+5.6
BFCL v3	Overall Accuracy	59.20	64.50	+5.30
FC-RewardBench evaluation demonstrates ToolRM's superior ability to identify correct tool calls compared to general-purpose baselines.
FC-RewardBench	Accuracy	45.0	88.0	+43.0

Experiment Figures

Accuracy comparison of various Reward Models and LLMs-as-Judges on the FC-RewardBench dataset.

Best-of-N performance gains across 5 benchmarks for different generator models (xLAM-2 and Qwen3 series).

Main Takeaways

Small Language Models (SLMs) benefit most from ToolRM-guided sampling, with Qwen3-0.6B matching or surpassing 70B parameter models using greedy decoding.
General-purpose reward models and even tool-augmented RMs (like Themis) perform poorly on function-calling verification, often failing to detect subtle parameter errors.
FC-RewardBench correlates strongly (0.84 correlation) with downstream Best-of-N performance, validating it as a reliable proxy for RM evaluation.
Improvements diminish for very large generator models (32B+), suggesting they make fewer errors that are detectable/correctable by the reward model.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and function calling
Familiarity with Reward Modeling (RM) and Reinforcement Learning from Human Feedback (RLHF)
Knowledge of Bradley-Terry preference models

Key Terms

ORM: Outcome Reward Model—a model that scores the final output of a system rather than intermediate steps

PRM: Process Reward Model—a model that evaluates intermediate reasoning steps

Tool-calling: The capability of an LLM to generate structured outputs (like JSON) to invoke external functions or APIs

Best-of-N: An inference strategy where N solutions are generated, scored by a reward model, and the highest-scoring one is selected

BFCL: Berkeley Function Calling Leaderboard—a standard benchmark for evaluating LLM tool-use capabilities

Bradley-Terry model: A statistical model used to predict the probability that one item is preferred over another in a pairwise comparison

SFT: Supervised Fine-Tuning—training a model on a dataset of labeled input-output pairs