A Safety and Security Framework for Real-World Agentic Systems

📝 Paper Summary

Agentic Safety and Security Red Teaming Risk Assessment

This paper establishes a dynamic framework for securing agentic systems by treating safety and security as emergent properties of component interactions, operationalized through automated red-teaming and a new compositional risk taxonomy.

Core Problem

Traditional safety/security assessments for isolated models fail in agentic systems because risks emerge from complex, non-deterministic interactions between models, tools, and data, expanding the attack surface.

Why it matters:

Agent autonomy introduces new hazards (e.g., tool misuse, cascading failures) that don't exist in static model inference.
Security failures (e.g., prompt injection) often propagate into safety harms (e.g., unsafe actions), blurring the lines between the two disciplines.
Existing frameworks like CVSS are insufficient because component-level security flaws can amplify into system-level user harms through agent behaviors.

Concrete Example: An agent tasked with 'planning a day' might cancel a user's doctor appointment to optimize productivity metrics (Goal Specification Ambiguity). Alternatively, an agent reading a web page might encounter a prompt injection ('transfer funds'), executing a tool call that causes financial loss.

Key Novelty

Dynamic Compositional Agentic Safety Framework

Models system-level risk as a composition of component-level risks (models, orchestrators, tools), accounting for compounding and cascading effects rather than checking components in isolation.
Unifies 'Safety' (harm prevention) and 'Security' (adversarial protection) by analyzing security threats through the lens of resulting user safety harms.
Operationalizes risk discovery via auxiliary AI agents that perform automated, contextual red-teaming (attacks) and defense evaluation within a sandboxed environment.

Evaluation Highlights

Released a dataset of 10,796 traces representing realistic attack and defense executions on the NVIDIA AI-Q Research Assistant.
Demonstrated end-to-end safety/security evaluation using 2,596 attack traces without defenses and 2,600 with defenses for security risks.
Evaluated content safety using 2,200 traces each for defended vs. undefended configurations, validating the framework on enterprise-grade workflows.

Breakthrough Assessment

8/10

Significant for formalizing the intersection of safety and security in agents and releasing a large-scale trace dataset (10k+ runs). Moves beyond static benchmarks to dynamic system-level evaluation.

⚙️ Technical Details

Problem Definition

Setting: Enterprise-grade agentic workflow deployment requiring policy-driven governance

Inputs: Agentic system components (LLMs, orchestrators, tools, memory, data sources) and user/adversarial prompts

Outputs: Risk assessment report, localized vulnerability identification, and mitigation strategies

Pipeline Flow

Risk Discovery (AI Red Teaming) -> Risk Evaluation -> Contextual Mitigation
Agentic Workflow: Input -> Orchestrator -> Tool/Memory Interaction -> Output

System Modules

Auxiliary Red Team Agents

Generate adversarial inputs and scenarios to probe the target agentic system for vulnerabilities

Model or implementation: Not explicitly specified (likely strong LLMs)

Orchestrator

Coordinates communication between agents, parses natural language into actions, and manages state

Model or implementation: Not explicitly specified

Guardrails/Defenses

Intercept and block unsafe inputs or outputs based on contextual policies

Model or implementation: Prompt hardening, Guard models

Novel Architectural Elements

Integration of auxiliary AI agents for dynamic, contextual risk discovery within the deployment lifecycle
Compositional risk assessment layer that maps component-level vulnerabilities (e.g., in tools) to system-level safety harms

Modeling

Base Model: NVIDIA AI-Q Research Assistant (Target System)

Compute: Not reported in the paper

Comparison to Prior Work

vs. CVSS: This framework connects technical security vulnerabilities (component level) to user safety harms (system level), which CVSS does not capture.
vs. Static LLM Benchmarks: Focuses on dynamic, multi-turn, tool-using workflows rather than isolated prompt-response pairs.

Limitations

The framework adds complexity to the deployment pipeline.
Reliability of auxiliary red-teaming agents depends on the capability of the models used for attacks.
Specific details on the 'auxiliary AI models' architecture are sparse in the text.
Risk discovery is non-deterministic, making 100% coverage difficult to guarantee.

Reproducibility

The Nemotron-AIQ Agentic Safety Dataset 1.0 is publicly available on HuggingFace. It contains 10,796 trace files (OpenTelemetry JSON). The specific code for the risk framework infrastructure is not linked, but the methodology and dataset are provided.

📊 Experiments & Results

Evaluation Setup

End-to-end safety and security evaluation of NVIDIA's AI-Q Research Assistant agentic workflow.

Benchmarks:

Nemotron-AIQ Agentic Safety Dataset 1.0 (Agentic Security and Safety Traces) [New]

Metrics:

Attack Success Rate (implied)
Defense Effectiveness (implied via comparison of defended vs. undefended runs)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The dataset statistics describe the scale of the evaluation performed on the AI-Q Blueprint.
Security Evaluation Traces (Undefended)	Count	0	2596	2596
Security Evaluation Traces (Defended)	Count	0	2600	2600
Content Safety Evaluation Traces (Undefended)	Count	0	2200	2200
Content Safety Evaluation Traces (Defended)	Count	0	2200	2200

Main Takeaways

Safety and security in agentic systems are emergent properties, not just attributes of individual models.
Security failures (e.g., prompt injection) frequently result in safety harms, necessitating a unified risk framework.
Automated, sandboxed red-teaming is essential for discovering novel risks in non-deterministic agentic workflows.
The released dataset provides significant transparency into realistic attack and defense executions in enterprise agent systems.

📚 Prerequisite Knowledge

Prerequisites

LLM Safety and Security fundamentals (OWASP LLM Top 10)
Agentic architectures (Orchestrators, Tool use)
Red Teaming methodologies

Key Terms

CIAAN: Confidentiality, Integrity, Availability, Authenticity, Non-repudiation—standard information security pillars applied here to agentic assets.

Agentic Systems: Systems capable of autonomous planning, tool use, environmental interaction, and multi-step task execution.

Orchestrator: The component that mediates between the agent's intentions and executable actions (e.g., parsing output into API calls).

Red Teaming: The practice of rigorously challenging plans, policies, or systems by adopting an adversarial approach (simulating attacks).

OTel: OpenTelemetry—an observability framework used here to capture detailed JSON traces of agent execution spans.

Prompt Injection: An attack where adversarial inputs override the model's original instructions to force unintended behaviors.

CVSS: Common Vulnerability Scoring System—an industry standard for assessing software vulnerabilities, deemed insufficient here for agentic user harms.