← Back to Paper List

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun
Shanghai AI Laboratory, Hunan University, Xiamen University, Tencent, University of Chinese Academy of Sciences, Tongji University
arXiv (2026)
Agent Benchmark Factuality

📝 Paper Summary

Financial Tool Learning Agent Evaluation
FinToolBench is a benchmark for financial agents featuring 760 executable tools that evaluates not just task success but also strict finance compliance regarding timeliness, intent restraint, and regulatory domain alignment.
Core Problem
Existing financial benchmarks rely on static textual analysis without executable tools, while general tool benchmarks lack the domain-specific rigor (timeliness, strict compliance) required for high-stakes finance.
Why it matters:
  • A syntactically correct tool call can be damaging if it retrieves stale data or accesses a mismatched market domain (e.g., equity vs. crypto)
  • Agents must distinguish between informational queries and transactional actions to avoid unauthorized execution
  • Current metrics fail to catch 'hallucinations of domain' or timeliness violations, which are critical recurring failure modes in finance
Concrete Example: If a user asks about cryptocurrency, utilizing equity market tools is a 'hallucination of domain.' Similarly, answering a request for 'current' exchange rates with a daily snapshot is a failure, even if the API call is valid.
Key Novelty
Auditable Financial Compliance Evaluation
  • Establishes a realistic ecosystem of 760 executable free-tier tools (RapidAPI, AkShare) paired with 295 tool-required queries
  • Annotates every tool with finance-specific attributes (timeliness, intent type, regulatory domain) to enable automated compliance auditing
  • Decouples 'capability' (successful execution) from 'compliance' (adherence to finance constraints), introducing specific mismatch rate metrics for the latter
Breakthrough Assessment
8/10
Significant advance in evaluating agent trustworthiness by moving beyond binary execution success to measuring finance-specific constraints (timeliness, domain) in a fully runnable environment.
×