DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

📝 Paper Summary

Security evaluation for agents Adversarial attacks on agents

DoomArena is a modular, plug-in framework that injects configurable adversarial attacks into existing agent benchmarks (like BrowserGym and OSWorld) to evaluate agent security under realistic, evolving threat models.

Core Problem

Existing agent security benchmarks use static, pre-defined malicious prompts that fail to capture the dynamic, stateful risks agents face in interactive environments.

Why it matters:

Agents deployed in the enterprise face evolving threats like data leakage and unauthorized transactions that static benchmarks miss
Current testing lacks flexibility to model specific threat vectors (e.g., malicious user vs. malicious environment) separately
Developers need a way to test agents against new attacks without rebuilding the entire evaluation environment from scratch

Concrete Example: A benign user asks an agent to order a product, but the agent interacts with a malicious retail API (the 'environment'). A static benchmark checking for 'malicious user prompts' would miss this, but DoomArena can inject an attack specifically into the API response to steal PII.

Key Novelty

Plug-in Adversarial Injection Framework

Decouples attack logic from the environment, allowing the same generic attacks (e.g., PII extraction) to be applied across different domains (web, tools, OS)
Introduces 'Attack Configs' that tag specific components of the agent-user-environment loop as malicious, enabling precise threat modeling (e.g., untrusted user vs. compromised website)
Operates as a wrapper around standard Gym-like environments, injecting attacks dynamically during the observation step without altering the underlying agent or task logic

Architecture

Conceptual diagram of the DoomArena framework interacting with an agent loop.

Evaluation Highlights

Combined threat models (malicious user + malicious catalog) significantly amplify risks, raising attack success rates compared to single-vector attacks in τ-Bench
Guardrail defenses (LlamaGuard) failed to detect code interpreter abuse attacks in τ-Bench, showing near-zero effectiveness
Visual pop-up attacks on OSWorld achieved 78.6% success rate against GPT-4o agents, significantly higher than the 22.9% success rate against Claude-3.7-Sonnet

Breakthrough Assessment

9/10

Highly significant contribution. It shifts agent security from static datasets to dynamic, environment-agnostic testing. Its ability to plug into major benchmarks (BrowserGym, OSWorld) ensures immediate practical impact.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of AI agents in interactive environments where specific components (User, Environment, Tools) can be compromised by an adversary.

Inputs: Benign task instructions (from user) + Observation stream (from environment, potentially modified by DoomArena)

Outputs: Agent actions (which may inadvertently fulfill the attacker's goal)

Pipeline Flow

Task Initialization (Environment sets up task)
Attack Injection (DoomArena modifies state/observation based on Config)
Agent Processing (Agent receives modified observation)
Action Execution (Agent acts on environment)
Success Verification (DoomArena checks if attack goal was met)

System Modules

Attack Config

Defines the threat model: selects the target component, the specific attack payload, and success criteria

Model or implementation: Configuration Object

Attack Gateway

Intercepts environment observations and injects malicious content before they reach the agent

Model or implementation: Python Wrapper (Gym Interface)

Agent

Performs the task based on observations

Model or implementation: GPT-4o or Claude-3.5-Sonnet (as tested)

Success Filter

Monitors the execution trace to determine if the attack succeeded

Model or implementation: Rule-based or LLM-based verifier

Novel Architectural Elements

Modular Attack Gateway interface that wraps standard Gym environments (reset/step) to inject adversarial content dynamically at inference time
Decoupled Attack Library where attack logic is separated from environment details, allowing reuse of attacks (e.g., PII theft) across Web, Tool, and OS domains

Modeling

Base Model: Evaluated on GPT-4o, Claude-3.5-Sonnet, and Claude-3.7-Sonnet

Training Method: Not applicable — this is an evaluation framework, not a training method

Compute: Not reported in the paper (Evaluation only)

Comparison to Prior Work

vs. AgentHarmBench: DoomArena evaluates interactive/stateful attacks via environment injection, whereas AgentHarmBench uses static prompt datasets
vs. AgentDojo: DoomArena plugs into external SOTA benchmarks (BrowserGym, τ-Bench) via gateways, whereas AgentDojo is restricted to its own custom tasks
vs. PyRIT: DoomArena is specifically designed for agentic loops (multi-step, tool use), whereas PyRIT focuses on single-turn or chat-based generative AI red teaming
+ 1 more
vs. TensorTrust [not cited in paper]: TensorTrust focuses on prompt injection in static text, whereas DoomArena handles multi-modal injections (visual overlays, HTML attributes) in agent loops

Limitations

Focuses on inference-time attacks; does not natively support training-time poisoning (though triggers can be simulated)
Requires implementation of an 'Attack Gateway' for each new environment (though common ones are provided)
Success filters for attacks often require environment-specific logic (e.g., defining what constitutes a 'refund' in a specific API)

Reproducibility

Code: https://github.com/ServiceNow/DoomArena

📊 Experiments & Results

Evaluation Setup

Security testing of SOTA agents across three domains: Tool-calling (τ-Bench), Web browsing (BrowserGym/WebArena), and Computer use (OSWorld).

Benchmarks:

τ-Bench (Tool-calling (Airline and Retail domains))
BrowserGym (WebArena) (Web browsing (Reddit and Shopping domains))
OSWorld (Desktop computer control (OS manipulation))

Metrics:

Attack Success Rate (ASR)
Task Success Rate (TSR)
Stealth Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
τ-Bench results showing the impact of threat models (Malicious User vs. Catalog) on attack success rates.
τ-Bench (Retail)	Attack Success Rate (ASR)	2.7	39.1	+36.4
WebArena (Reddit) results showing high vulnerability to pop-up attacks.
WebArena-Reddit	Attack Success Rate (ASR)	0	97.4	+97.4
WebArena-Reddit	Attack Success Rate (ASR)	0	88.5	+88.5
OSWorld results demonstrating the impact of visual pop-up attacks on desktop agents.
OSWorld	Attack Success Rate (ASR)	22.9	78.6	+55.7
Defense evaluation showing the failure of standard guardrails.
τ-Bench	Detection Rate	0	0	0

Experiment Figures

Bar chart of reported AI vulnerabilities over time (2021-2025).

Main Takeaways

No single agent is Pareto dominant across all threat models; vulnerability varies significantly by context (e.g., Claude-3.5-Sonnet is robust to shopping pop-ups but vulnerable on Reddit)
Constructive interference: When multiple attack vectors are combined (e.g., Malicious User + Malicious Catalog), they often work together to lower task success and raise attack success more than either alone
Standard guardrails like LlamaGuard are ineffective against complex agentic attacks, while frontier LLM-as-a-judge defenses work better but still allow non-trivial attack rates
Visual attacks (pop-ups) are highly effective against multimodal agents (OSWorld), with GPT-4o showing particular vulnerability compared to Claude-3.7

📚 Prerequisite Knowledge

Prerequisites

Understanding of agentic workflows (Observation-Thought-Action loops)
Familiarity with OpenAI Gym interface (reset/step)
Basic knowledge of prompt injection and adversarial attacks

Key Terms

ASR: Attack Success Rate—the fraction of tasks where the attacker's goal (e.g., stealing data) is achieved

TSR: Task Success Rate—the fraction of tasks where the agent successfully completes the user's original goal

Stealth Rate: The fraction of tasks where the attack succeeds AND the agent still completes the original user task (making the attack harder to notice)

Attack Gateway: A wrapper module that interfaces DoomArena with a specific environment (e.g., BrowserGym), handling the injection of malicious content into observations

Attack Config: A specification defining which component is malicious (User vs. Environment), what attack to use, and what constitutes success

Threat Model: A definition of which parts of the system are untrusted; for example, a 'Malicious User' model assumes the user input contains attacks

Gymnasium: A standard API for reinforcement learning environments where agents interact via reset() and step() methods

Guardrail: A safety mechanism (often a separate LLM) that monitors agent inputs/outputs and aborts execution if malicious content is detected

PII: Personally Identifiable Information—sensitive user data like names, addresses, or credit card numbers

ARIA labels: Accessibility attributes in HTML (e.g., aria-label) often used to hide prompt injections from humans while remaining visible to web agents