FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

📝 Paper Summary

Financial Tool Learning Agent Evaluation

FinToolBench is a benchmark for financial agents featuring 760 executable tools that evaluates not just task success but also strict finance compliance regarding timeliness, intent restraint, and regulatory domain alignment.

Core Problem

Existing financial benchmarks rely on static textual analysis without executable tools, while general tool benchmarks lack the domain-specific rigor (timeliness, strict compliance) required for high-stakes finance.

Why it matters:

A syntactically correct tool call can be damaging if it retrieves stale data or accesses a mismatched market domain (e.g., equity vs. crypto)
Agents must distinguish between informational queries and transactional actions to avoid unauthorized execution
Current metrics fail to catch 'hallucinations of domain' or timeliness violations, which are critical recurring failure modes in finance

Concrete Example: If a user asks about cryptocurrency, utilizing equity market tools is a 'hallucination of domain.' Similarly, answering a request for 'current' exchange rates with a daily snapshot is a failure, even if the API call is valid.

Key Novelty

Auditable Financial Compliance Evaluation

Establishes a realistic ecosystem of 760 executable free-tier tools (RapidAPI, AkShare) paired with 295 tool-required queries
Annotates every tool with finance-specific attributes (timeliness, intent type, regulatory domain) to enable automated compliance auditing
Decouples 'capability' (successful execution) from 'compliance' (adherence to finance constraints), introducing specific mismatch rate metrics for the latter

Breakthrough Assessment

8/10

Significant advance in evaluating agent trustworthiness by moving beyond binary execution success to measuring finance-specific constraints (timeliness, domain) in a fully runnable environment.

⚙️ Technical Details

Problem Definition

Setting: Agentic tool use in a financial environment with strict compliance constraints

Inputs: Natural language financial query q requiring external data retrieval

Outputs: Executable tool trace τ and final answer derived from tool outputs

Pipeline Flow

User Query → Tool Retrieval (FATR Baseline)
Tool Selection & Argument Generation
Execution Environment (Real APIs)
Response Generation & Trace Logging

System Modules

Tool Retrieval (FATR)

Retrieve candidate tools and inject finance attributes (timeliness, domain) into tool cards

Model or implementation: Not specified in snippet

Execution Environment

Execute API calls and log structured traces for auditing

Model or implementation: Python Execution Engine

Evaluation Judge

Assess answer correctness and trace compliance

Model or implementation: GPT-5.1

Comparison to Prior Work

vs. FinanceBench: FinToolBench involves executable tools and dynamic data vs. static document QA
vs. StableToolBench: FinToolBench evaluates domain alignment and timeliness (compliance) vs. just execution success
vs. Finance Agent Benchmark: FinToolBench provides a large-scale library (760 tools) and call-level attribute auditing vs. limited mock interfaces

Limitations

Relies on free-tier APIs which may have rate limits or stability issues over time
Evaluation uses LLM judges (GPT-5.1) which can be unstable, though repeated judging is used to mitigate this
Requires automated filtering and majority-vote labeling for scaling, which may introduce noise compared to fully human-curated datasets

Reproducibility

Tool manifest, execution environment, and evaluation code will be open-sourced. The benchmark relies on free-tier tools from RapidAPI and AkShare to ensure accessibility without proprietary contracts. 760 tools were curated from an initial pool of 5,470 interfaces.

📊 Experiments & Results

Evaluation Setup

Agents execute real API calls to answer financial queries. Traces are logged and audited against metadata.

Benchmarks:

FinToolBench (Agentic Financial Tool Use (QA + Action)) [New]

Metrics:

TIR (Tool Invocation Rate)
TESR (Tool Execution Success Rate)
Soft Score (Answer Correctness)
TMR (Timeliness Mismatch Rate)
IMR (Intent Mismatch Rate)
DMR (Domain Mismatch Rate)
Statistical methodology: Three-repeat averaging for LLM-based judging to reduce variance.

Main Takeaways

Current agent metrics are blind to critical financial failure modes: timeliness, intent restraint, and domain alignment.
A wrong tool call in finance can be more damaging than a wrong text answer because it appears grounded while using invalid data (e.g., stale prices).
FinToolBench provides the first testbed for 'auditable' agentic execution, where every step can be checked against regulatory and market constraints.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM agents and tool use (function calling)
Basic financial concepts (tickers, market domains)
Knowledge of API structures (parameters, endpoints)

Key Terms

RapidAPI: A large marketplace of third-party APIs used here for diverse, real-time financial services

AkShare: An open-source Python library providing reliable, research-oriented financial data interfaces

TMR: Timeliness Mismatch Rate—fraction of questions where the agent used tools with insufficient data freshness (e.g., daily vs. real-time)

IMR: Intent Mismatch Rate—fraction of questions where the agent violated intent constraints (e.g., executing a transaction when only information was requested)

DMR: Domain Mismatch Rate—fraction of questions where the agent used tools from the wrong regulatory domain (e.g., accessing crypto tools for a stock query)

FATR: Finance-Aware Tool Retrieval—a baseline method proposed in the paper that injects finance attributes into tool cards and stabilizes execution

Soft Score: A correctness metric for structured/free-text answers evaluated by an LLM judge

CSS: Compliance-aware Soft Score—the mean Soft Score over samples with successful tool execution