← Back to Paper List

Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

Fabrizio Dimino, Bhaskarjit Sarmah, Stefano Pasquali
arXiv (2026)
Benchmark Agent

📝 Paper Summary

LLM Safety and Security Domain-Specific Red Teaming Automated Red Teaming
A domain-specific framework for evaluating LLM security in financial services that combines a fine-grained risk taxonomy, adaptive multi-turn red teaming, and a severity-weighted scoring metric.
Core Problem
Existing red-teaming benchmarks are domain-agnostic and rely on binary success metrics, failing to capture the severity, regulatory implications, and escalation dynamics of AI failures in regulated financial environments.
Why it matters:
  • Financial AI systems operate in high-stakes, regulated environments where failures cause direct regulatory violations or systemic risk
  • Standard single-turn benchmarks miss failure modes that only emerge through sustained, adaptive conversational pressure
  • Binary success rates obscure critical differences between a harmless refusal and a severe, operationally actionable disclosure of financial misconduct
Concrete Example: A model might refuse a direct request to 'launder money' (single-turn success) but, under adaptive questioning about 'optimizing cross-border cash flows for privacy,' disclose actionable structuring techniques that violate AML regulations—a failure missed by binary metrics.
Key Novelty
Risk-Adjusted Harm Scoring (RAHS) within a BFSI-specific Red Teaming Loop
  • Introduces a financial-specific taxonomy (FinRedTeamBench) mapping generic jailbreaks to specific regulatory risks like market manipulation or insider trading
  • Proposes RAHS, a continuous metric that penalizes severity and lack of disclaimers while rewarding consensus among an ensemble of judges
  • Uses an attacker LLM that iteratively refines prompts based on structured feedback from the judge ensemble to uncover latent vulnerabilities
Evaluation Highlights
  • Adaptive multi-turn interactions consistently increased jailbreak success rates compared to single-turn baselines across tested models
  • Higher decoding stochasticity (temperature) in target models correlated with increased severity of financial disclosures
  • The RAHS metric successfully differentiated between low-risk borderline outputs and high-severity actionable disclosures where binary metrics treated them identically
Breakthrough Assessment
7/10
Strong contribution to domain-specific safety evaluation. The RAHS metric and financial taxonomy fill a critical gap for regulated industries, though the core adversarial method relies on established techniques (LLM-as-a-judge, iterative refinement).
×