← Back to Paper List

ToolRM: Outcome Reward Models for Tool-Calling Large Language Models

Mayank Agarwal, Ibrahim Abdelaziz, Kinjal Basu, Merve Unuvar, Luis A. Lastras, Yara Rizk, P. Kapanipathi
IBM Research, USA
arXiv.org (2025)
Agent RL Benchmark

📝 Paper Summary

Reward Modeling Agentic AI Tool Use / Function Calling
ToolRM is a suite of outcome reward models trained on synthetically generated incorrect tool calls to accurately evaluate and improve tool-use performance in large language models.
Core Problem
Existing reward models are designed primarily for natural language chat and struggle to detect the nuances of tool-based reasoning, such as subtle parameter errors or missing arguments.
Why it matters:
  • Current general-purpose reward models frequently miss key signals of effective tool use, leading to poor alignment in agentic workflows
  • There is no dedicated benchmark for evaluating reward models specifically on function-calling tasks, making it difficult to quantify improvements
  • Reliable automated evaluation is critical for scaling training techniques like reinforcement learning and rejection sampling without human labeling
Concrete Example: A model might generate a tool call with an incorrect parameter value or missing optional parameter (e.g., calling 'search' without a required 'query' argument). General reward models often score this highly because it looks like a valid function call structure, whereas ToolRM is trained to reject these specific subtle errors.
Key Novelty
ToolRM (Tool Outcome Reward Model)
  • Trains a specialized outcome reward model (ORM) specifically for function calling by contrasting correct ground-truth calls against incorrect calls generated by a diverse pool of open-weight models
  • Introduces FC-RewardBench, a dataset of 1500 difficult pairwise comparisons derived from the Berkeley Function Calling Leaderboard to rigorously test reward model sensitivity to tool errors
  • Demonstrates that an RM trained on this domain-specific synthetic data can significantly boost inference performance via Best-of-N sampling
Evaluation Highlights
  • +24.9% improvement in accuracy for Qwen3-0.6B on downstream tool benchmarks using ToolRM for Best-of-32 sampling compared to greedy decoding
  • ToolRM-1.5B outperforms much larger models (including gpt-oss-120B) on the proposed FC-RewardBench evaluation dataset
  • Data filtering using ToolRM enables training fine-tuned models that outperform baselines while using only 50% of the training data
Breakthrough Assessment
8/10
Addresses a critical gap in agentic AI (reward modeling for tools) with a comprehensive solution: a new benchmark, a scalable synthetic data method, and strong empirical results showing significant gains on top of strong base models.
×