Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

📝 Paper Summary

LLM Agent Evaluation AI Safety in Finance

Financial LLM agents should be evaluated on their risk profile (safety, hallucination, adversarial robustness) rather than just accuracy, using a three-level auditing framework called SAEA.

Core Problem

Standard financial benchmarks measure task performance (accuracy, F1) but overlook critical safety risks like hallucinations, stale data, and adversarial vulnerabilities, creating an illusion of reliability.

Why it matters:

Financial systems are adversarial and coupled; minor errors (e.g., incorrect exchange rate) can cascade into multi-million dollar losses
Current benchmarks are static and accuracy-focused, failing to capture dynamic failure modes like prompt injection or tool misuse in high-stakes environments
High-performing agents on leaderboards can still exhibit dangerous behaviors, exposing institutions to systemic and regulatory risks

Concrete Example: A user asks an agent to withdraw Bitcoin. A high-accuracy agent might hallucinate a complete address from a partial input ('bc1q...') and execute a transaction to a non-existent or wrong wallet, causing irreversible loss, whereas a safe agent would stop and request clarification.

Key Novelty

Safety-Aware Evaluation Agent (SAEA) Framework

Shifts evaluation from performance-centric metrics (Accuracy/F1) to a risk-centric auditing paradigm rooted in financial risk engineering
Implements a three-level audit taxonomy: Model (intrinsic faults), Workflow (process reliability/error propagation), and System (integration/external events)
Acts as a 'shadow auditor' that wraps existing tasks to probe agents for specific failure modes like temporal staleness or adversarial vulnerability without needing new training datasets

Architecture

Overview of the SAEA auditing pipeline.

Evaluation Highlights

DeepSeek-R1 scored 0.0 on Hallucination Severity for safe trajectories in Finance Management, but jumped to 25.0 on unsafe ones, showing risk varies by context
Llama-3.1-8b exhibited high Error Propagation scores (35.0) in Transactional Services for safe trajectories, indicating significant vulnerability to cascading errors even when 'correct'
Claude-3.5-Sonnet showed high Adversarial Robustness variance: 0.0/11.0 in Transactional Services vs 0.0/28.3 in Finance Management, revealing domain-sensitive fragility

Breakthrough Assessment

7/10

Strong conceptual shift towards risk-engineering in financial AI. While not a new model architecture, the auditing framework addresses a critical gap in deployment safety that standard leaderboards ignore.

⚙️ Technical Details

Problem Definition

Setting: Auditing black-box and open-weights financial agents under risk-specific constraints

Inputs: Agent trajectory D (sequence of observations, thoughts, actions) and task T

Outputs: Risk profile containing safety scores across varying dimensions (e.g., hallucination, confidence, robustness)

Pipeline Flow

Task & Trajectory Analysis (SAEA observes agent I/O)
Evaluation Agent Selection (Selects specific risk evaluators)
Metric Aggregation (Compiles risk profile)

System Modules

Task & Trajectory Analysis (Auditing Core)

Review task and trajectory to identify potential risk surfaces and select relevant metrics

Model or implementation: Evaluator LLM (e.g., GPT-4o, Claude-3.5-Sonnet)

Evaluation Agent (Auditing Core)

Execute specific risk probes (e.g., check for hallucination, test adversarial prompts)

Model or implementation: Evaluator LLM (e.g., GPT-4o, Claude-3.5-Sonnet)

Metric Aggregator

Synthesize individual probe results into a composite risk profile

Model or implementation: Algorithmic Aggregation

Novel Architectural Elements

Three-level risk auditing taxonomy (Model, Workflow, System) adapted from Basel II/III and NIST guidelines applied to LLM agents
Shadow auditing protocol that wraps existing agent workflows without requiring retraining

Modeling

Base Model: Evaluated Agents: GPT-4o, Claude-3.5-Sonnet, Qwen3-235b-a22b, DeepSeek-R1, Llama-3.3-70b, Llama-3.1-8b

Comparison to Prior Work

vs. PIXIU/FinCoin: SAEA focuses on safety/risk profiles (hallucination, robustness) rather than task accuracy or financial return metrics
vs. R-Judge: SAEA introduces a multi-level taxonomy (model/workflow/system) and stress-testing, whereas R-Judge focuses on static safety judgment tasks
vs. InvestorBench [not cited in paper]: InvestorBench focuses on decision-making quality, while SAEA focuses specifically on failure modes and operational risk

Limitations

Accessing real-time financial data for stress testing is difficult and expensive
Constructing risk-focused scenarios requires significant domain expertise and human-in-the-loop oversight
Auditing imposes computational overhead and latency, potentially conflicting with high-frequency trading requirements
Evaluator LLMs (used to judge the agents) may themselves have biases or errors

Reproducibility

Code: https://chen-zichen.github.io/SAEA/

SAEA is described as an open, modular protocol. The paper link points to a GitHub page (https://chen-zichen.github.io/SAEA/). The evaluation uses standard models (GPT-4o, Llama-3, etc.) and tasks from the R-Judge benchmark.

📊 Experiments & Results

Evaluation Setup

Auditing 6 LLM agents on 3 high-impact financial tasks using SAEA to measure 9 risk metrics

Benchmarks:

Finance Management (Cryptocurrency use-cases (Bitcoin, Ethereum, Binance))
Webshop Automation (Online shop and Shopify integrations)
Transactional Services (Bank and PayPal scenarios)

Metrics:

Hallucination severity
Temporal accuracy
Confidence score
Adversarial robustness
Explanation clarity
Error propagation
Prompt sensitivity
Response degradation
Stress testing
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The following results compare 'Safe' vs 'Unsafe' trajectory scores (Safe/Unsafe) across different models and tasks. Lower scores generally indicate lower risk presence.
Finance Management	Hallucination severity	5.0/28.3	0.0/22.5	-5.0/-5.8
Finance Management	Adversarial robustness	8.3/27.2	0.0/17.2	-8.3/-10.0
Transactional Services	Error propagation	35.0/29.6	25.0/15.0	-10.0/-14.6
Webshop Automation	Stress testing	22.5/31.0	0.0/18.5	-22.5/-12.5
Finance Management	Temporal accuracy	18.3/38.2	3.3/21.7	-15.0/-16.5

Main Takeaways

Accuracy does not equate to safety: Models with high performance on standard metrics can still exhibit severe vulnerabilities (e.g., hallucination, prompt injection) when audited for risk.
Risk is domain-sensitive: Failure modes vary significantly by task (e.g., adversarial robustness scores differ between Finance Management and Transactional Services for the same model).
Hidden failures revealed: SAEA uncovered risks like error propagation and temporal staleness that standard benchmarks miss, particularly when multiple perturbations are combined.
Smaller models (e.g., Llama-3.1-8b) tend to have higher risk scores across multiple dimensions compared to larger models like GPT-4o or DeepSeek-R1.

📚 Prerequisite Knowledge

Prerequisites

Financial risk management concepts (Model risk, Operational risk)
LLM agent architectures (Tool use, Chain-of-Thought)
Adversarial attack concepts (Prompt injection)

Key Terms

SAEA: Safety-Aware Evaluation Agent—a framework that shadows an agent's input/output stream to audit risks at model, workflow, and system levels

Model Context Protocol (MCP): A standard for connecting AI assistants to systems where data lives, enabling logging of tool use and queries

CoT: Chain-of-Thought—prompting technique where models generate intermediate reasoning steps before the final answer

Hallucination: When an LLM generates factually incorrect or non-existent information, such as fabricating earnings data or crypto addresses

Adversarial Robustness: The ability of a model to maintain safety and correct behavior when faced with malicious or manipulative inputs (prompt injections)

Zero-shot CoT: Asking a model to reason step-by-step without providing specific examples in the prompt