← Back to Paper List

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

L'eo Boisvert, Mihir Bansal, Chandra Kiran Reddy Evuru, Gabriel Huang, Abhay Puri, Avinandan Bose, Maryam Fazel, Quentin Cappart, Jason Stanley, Alexandre Lacoste, Alexandre Drouin, K. Dvijotham
ServiceNow Research, University of Washington, Seattle
arXiv.org (2025)
Agent Benchmark MM

📝 Paper Summary

Security evaluation for agents Adversarial attacks on agents
DoomArena is a modular, plug-in framework that injects configurable adversarial attacks into existing agent benchmarks (like BrowserGym and OSWorld) to evaluate agent security under realistic, evolving threat models.
Core Problem
Existing agent security benchmarks use static, pre-defined malicious prompts that fail to capture the dynamic, stateful risks agents face in interactive environments.
Why it matters:
  • Agents deployed in the enterprise face evolving threats like data leakage and unauthorized transactions that static benchmarks miss
  • Current testing lacks flexibility to model specific threat vectors (e.g., malicious user vs. malicious environment) separately
  • Developers need a way to test agents against new attacks without rebuilding the entire evaluation environment from scratch
Concrete Example: A benign user asks an agent to order a product, but the agent interacts with a malicious retail API (the 'environment'). A static benchmark checking for 'malicious user prompts' would miss this, but DoomArena can inject an attack specifically into the API response to steal PII.
Key Novelty
Plug-in Adversarial Injection Framework
  • Decouples attack logic from the environment, allowing the same generic attacks (e.g., PII extraction) to be applied across different domains (web, tools, OS)
  • Introduces 'Attack Configs' that tag specific components of the agent-user-environment loop as malicious, enabling precise threat modeling (e.g., untrusted user vs. compromised website)
  • Operates as a wrapper around standard Gym-like environments, injecting attacks dynamically during the observation step without altering the underlying agent or task logic
Architecture
Architecture Figure Figure 1
Conceptual diagram of the DoomArena framework interacting with an agent loop.
Evaluation Highlights
  • Combined threat models (malicious user + malicious catalog) significantly amplify risks, raising attack success rates compared to single-vector attacks in τ-Bench
  • Guardrail defenses (LlamaGuard) failed to detect code interpreter abuse attacks in τ-Bench, showing near-zero effectiveness
  • Visual pop-up attacks on OSWorld achieved 78.6% success rate against GPT-4o agents, significantly higher than the 22.9% success rate against Claude-3.7-Sonnet
Breakthrough Assessment
9/10
Highly significant contribution. It shifts agent security from static datasets to dynamic, environment-agnostic testing. Its ability to plug into major benchmarks (BrowserGym, OSWorld) ensures immediate practical impact.
×