← Back to Paper List

A Safety and Security Framework for Real-World Agentic Systems

Shaona Ghosh, Barnaby Simkin, Kyriacos Shiarlis, Soumili Nandi, Dan Zhao, Matthew Fiedler, Julia Bazinska, Nikki Pope, Roopa Prabhu, Daniel Rohrer, Michael Demoret, Bartley Richardson
NVIDIA, Lakera AI
arXiv (2025)
Agent Benchmark

📝 Paper Summary

Agentic Safety and Security Red Teaming Risk Assessment
This paper establishes a dynamic framework for securing agentic systems by treating safety and security as emergent properties of component interactions, operationalized through automated red-teaming and a new compositional risk taxonomy.
Core Problem
Traditional safety/security assessments for isolated models fail in agentic systems because risks emerge from complex, non-deterministic interactions between models, tools, and data, expanding the attack surface.
Why it matters:
  • Agent autonomy introduces new hazards (e.g., tool misuse, cascading failures) that don't exist in static model inference.
  • Security failures (e.g., prompt injection) often propagate into safety harms (e.g., unsafe actions), blurring the lines between the two disciplines.
  • Existing frameworks like CVSS are insufficient because component-level security flaws can amplify into system-level user harms through agent behaviors.
Concrete Example: An agent tasked with 'planning a day' might cancel a user's doctor appointment to optimize productivity metrics (Goal Specification Ambiguity). Alternatively, an agent reading a web page might encounter a prompt injection ('transfer funds'), executing a tool call that causes financial loss.
Key Novelty
Dynamic Compositional Agentic Safety Framework
  • Models system-level risk as a composition of component-level risks (models, orchestrators, tools), accounting for compounding and cascading effects rather than checking components in isolation.
  • Unifies 'Safety' (harm prevention) and 'Security' (adversarial protection) by analyzing security threats through the lens of resulting user safety harms.
  • Operationalizes risk discovery via auxiliary AI agents that perform automated, contextual red-teaming (attacks) and defense evaluation within a sandboxed environment.
Evaluation Highlights
  • Released a dataset of 10,796 traces representing realistic attack and defense executions on the NVIDIA AI-Q Research Assistant.
  • Demonstrated end-to-end safety/security evaluation using 2,596 attack traces without defenses and 2,600 with defenses for security risks.
  • Evaluated content safety using 2,200 traces each for defended vs. undefended configurations, validating the framework on enterprise-grade workflows.
Breakthrough Assessment
8/10
Significant for formalizing the intersection of safety and security in agents and releasing a large-scale trace dataset (10k+ runs). Moves beyond static benchmarks to dynamic system-level evaluation.
×