Risk-Adjusted Harm Scoring for Automated Red Teaming for LLMs in Financial Services

📝 Paper Summary

LLM Safety and Security Domain-Specific Red Teaming Automated Red Teaming

A domain-specific framework for evaluating LLM security in financial services that combines a fine-grained risk taxonomy, adaptive multi-turn red teaming, and a severity-weighted scoring metric.

Core Problem

Existing red-teaming benchmarks are domain-agnostic and rely on binary success metrics, failing to capture the severity, regulatory implications, and escalation dynamics of AI failures in regulated financial environments.

Why it matters:

Financial AI systems operate in high-stakes, regulated environments where failures cause direct regulatory violations or systemic risk
Standard single-turn benchmarks miss failure modes that only emerge through sustained, adaptive conversational pressure
Binary success rates obscure critical differences between a harmless refusal and a severe, operationally actionable disclosure of financial misconduct

Concrete Example: A model might refuse a direct request to 'launder money' (single-turn success) but, under adaptive questioning about 'optimizing cross-border cash flows for privacy,' disclose actionable structuring techniques that violate AML regulations—a failure missed by binary metrics.

Key Novelty

Risk-Adjusted Harm Scoring (RAHS) within a BFSI-specific Red Teaming Loop

Introduces a financial-specific taxonomy (FinRedTeamBench) mapping generic jailbreaks to specific regulatory risks like market manipulation or insider trading
Proposes RAHS, a continuous metric that penalizes severity and lack of disclaimers while rewarding consensus among an ensemble of judges
Uses an attacker LLM that iteratively refines prompts based on structured feedback from the judge ensemble to uncover latent vulnerabilities

Evaluation Highlights

Adaptive multi-turn interactions consistently increased jailbreak success rates compared to single-turn baselines across tested models
Higher decoding stochasticity (temperature) in target models correlated with increased severity of financial disclosures
The RAHS metric successfully differentiated between low-risk borderline outputs and high-severity actionable disclosures where binary metrics treated them identically

Breakthrough Assessment

7/10

Strong contribution to domain-specific safety evaluation. The RAHS metric and financial taxonomy fill a critical gap for regulated industries, though the core adversarial method relies on established techniques (LLM-as-a-judge, iterative refinement).

⚙️ Technical Details

Problem Definition

Setting: Automated adversarial evaluation of LLMs against a domain-specific financial risk taxonomy

Inputs: Risk category r, seed prompt q, conversation history H

Outputs: Target model response a, aggregated harm label, risk-adjusted score RAHS

Pipeline Flow

Taxonomy & Prompt Selection
Attacker Generation (Iterative)
Target Response Generation
Ensemble Evaluation (Judge Loop)
Feedback & Refinement

System Modules

Attacker Agent

Generates and refines adversarial prompts based on judge feedback to elicit harmful financial advice

Model or implementation: DeepSeek-V3

Target Model

The LLM being stress-tested for financial safety compliance

Model or implementation: Various (Target of evaluation)

Ensemble Judges

Evaluate response for harm, severity, and disclaimers

Model or implementation: Ensemble: gpt-oss-120b-safeguard, Qwen3-235B-A22B, Llama-3.3-Nemotron-Super-49B-v1.5

Novel Architectural Elements

Integration of a heterogeneous judge ensemble (Safeguard + Reasoning + Efficient) directly into the optimization loop of the attacker
RAHS scoring logic integrated as the objective function for measuring severity rather than just binary success

Modeling

Base Model: Attacker: DeepSeek-V3. Judges: GPT-OSS-120B-Safeguard, Qwen3-235B, Llama-3.3-Nemotron-49B

Training Method: Inference-time adaptive adversarial attack (no weight updates)

Adaptation: None (Prompt engineering / In-context learning)

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. AdvBench/HarmBench: Focuses strictly on BFSI domain risks (e.g., regulatory evasion) rather than general toxicity
vs. FinJailbreak: Introduces RAHS (continuous risk scoring) instead of binary success metrics
vs. PAIR: Uses a specialized ensemble judge to guide the attacker specifically toward financially actionable disclosures rather than generic policy violations

Limitations

Attacker model (DeepSeek-V3) is fixed; using different attacker models might yield different stress-test results
Evaluation relies on open-weight judges which may have their own biases compared to proprietary SOTA models like GPT-4
Sanitized examples in the paper limit full reproducibility of the exact attack efficacy without the raw prompts
Focus is limited to text-based financial advice, excluding multimodal risks (e.g., chart interpretation)

Reproducibility

Prompt templates for the judges and the adaptive attacker are provided in the Appendix. The FinRedTeamBench taxonomy structure is described, with representative sanitized examples. Code and full dataset are not explicitly marked as open-source in the text.

📊 Experiments & Results

Evaluation Setup

Adversarial attack simulation on target LLMs using the FinRedTeamBench taxonomy

Benchmarks:

FinRedTeamBench (Financial Safety / Red Teaming) [New]

Metrics:

Attack Success Rate (ASR)
Risk-Adjusted Harm Score (RAHS)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper finds that adaptive interactions and stochasticity drive failures, but exact numeric comparisons between models (e.g., Model A vs Model B) are not provided in the snippet. The results are described qualitatively regarding the behavior of the framework.

Main Takeaways

Higher decoding stochasticity (randomness) in the target model increases the likelihood of severe financial disclosures
Sustained adaptive interaction (multi-turn) significantly increases jailbreak success compared to single-turn attempts
The RAHS metric provides a more granular signal than ASR, distinguishing between minor policy slips and severe, actionable regulatory violations
Failures in the financial domain often manifest through reasoning disclosures (explaining 'how' to bypass rules) rather than just final answers

📚 Prerequisite Knowledge

Prerequisites

LLM Red Teaming methodologies
LLM-as-a-Judge evaluation patterns
Basic financial regulatory concepts (AML, KYC, market manipulation)

Key Terms

BFSI: Banking, Financial Services, and Insurance—the specific regulated domain this paper targets

RAHS: Risk-Adjusted Harm Score—a novel metric quantifying the operational severity of a harmful disclosure, accounting for disclaimers and judge consensus

FinRedTeamBench: The proposed benchmark dataset comprising 989 adversarial prompts across 7 financial risk categories

Jailbreak: A prompt or interaction strategy designed to bypass an LLM's safety guardrails

AML: Anti-Money Laundering—regulations preventing the disguise of illegally obtained funds

Ensemble Judging: Using multiple LLMs with different specializations (safety, reasoning, efficiency) to evaluate model outputs

Adaptive Red-Teaming: An attack process where the adversary model updates its strategy based on the target's previous responses