Evaluation Setup
Auditing 6 LLM agents on 3 high-impact financial tasks using SAEA to measure 9 risk metrics
Benchmarks:
- Finance Management (Cryptocurrency use-cases (Bitcoin, Ethereum, Binance))
- Webshop Automation (Online shop and Shopify integrations)
- Transactional Services (Bank and PayPal scenarios)
Metrics:
- Hallucination severity
- Temporal accuracy
- Confidence score
- Adversarial robustness
- Explanation clarity
- Error propagation
- Prompt sensitivity
- Response degradation
- Stress testing
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The following results compare 'Safe' vs 'Unsafe' trajectory scores (Safe/Unsafe) across different models and tasks. Lower scores generally indicate lower risk presence. |
| Finance Management |
Hallucination severity |
5.0/28.3 |
0.0/22.5 |
-5.0/-5.8
|
| Finance Management |
Adversarial robustness |
8.3/27.2 |
0.0/17.2 |
-8.3/-10.0
|
| Transactional Services |
Error propagation |
35.0/29.6 |
25.0/15.0 |
-10.0/-14.6
|
| Webshop Automation |
Stress testing |
22.5/31.0 |
0.0/18.5 |
-22.5/-12.5
|
| Finance Management |
Temporal accuracy |
18.3/38.2 |
3.3/21.7 |
-15.0/-16.5
|
Main Takeaways
- Accuracy does not equate to safety: Models with high performance on standard metrics can still exhibit severe vulnerabilities (e.g., hallucination, prompt injection) when audited for risk.
- Risk is domain-sensitive: Failure modes vary significantly by task (e.g., adversarial robustness scores differ between Finance Management and Transactional Services for the same model).
- Hidden failures revealed: SAEA uncovered risks like error propagation and temporal staleness that standard benchmarks miss, particularly when multiple perturbations are combined.
- Smaller models (e.g., Llama-3.1-8b) tend to have higher risk scores across multiple dimensions compared to larger models like GPT-4o or DeepSeek-R1.