HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

📝 Paper Summary

AI Safety Evaluation Agentic AI Social Simulation

HAICOSYSTEM is a modular sandbox framework that evaluates AI agent safety by simulating complex multi-turn interactions between agents, users (benign or malicious), and tools across diverse scenarios.

Core Problem

Current safety evaluations isolate specific risks (e.g., toxic content) in static single-turn interactions, failing to capture the complex, holistic risks that arise when agents autonomously use tools and interact with users over time.

Why it matters:

Agents are increasingly autonomous, controlling tools and environments (e.g., smart homes, financial apps), where failures can cause physical or financial harm
Single-turn benchmarks like DAN miss risks that only emerge through multi-turn manipulation or underspecified instructions
Existing works typically focus solely on malicious users or solely on tool misuse, ignoring the ecosystem of agent-user-environment dynamics

Concrete Example: A benign user might ask an agent to 'clean up files' in a shared workspace. Without proper context or clarification in a multi-turn dialogue, the agent might interpret this as permission to delete critical system files, causing irreversible data loss—a risk missed by static toxicity classifiers.

Key Novelty

Holistic Ecosystem Simulation (HAICOSYSTEM)

Simulates a full ecosystem where an AI Agent interacts with a simulated User (benign or malicious) and an Environment Engine (tools) over multiple turns
Introduces a multi-dimensional risk taxonomy covering operational, content, societal, and legal risks, assessed by an automated LLM-based evaluator
Uses 'invisible' checklists of safe/risky outcomes for each scenario to ground evaluations in specific context-aware goals

Architecture

Overview of HAICOSYSTEM showing the interactions between the AI Agent, Simulated User, and Simulated Environment.

Evaluation Highlights

State-of-the-art LLMs (including GPT-4 and Llama-3) exhibit safety risks in 62% of the 8,700 simulated episodes
Multi-turn interactions surface up to 3x more safety risks compared to static single-turn benchmarks like DAN
Agents are 46% more likely to exhibit risks when navigating complex environments with malicious users compared to interacting with malicious users alone

Breakthrough Assessment

8/10

Significant step forward in agentic safety. Moves beyond static prompts to dynamic, stateful simulations. The ecosystem approach is crucial for deploying autonomous agents.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn interaction simulation involving a User, an AI Agent, and an Environment, evaluated against a safety checklist

Inputs: Scenario description (background, goals), User Profile, Agent System Prompt + Tools

Outputs: Interaction trajectory (dialogue + tool calls) and Safety Risk Scores (binary/numerical)

Pipeline Flow

Scenario Initialization: Load background, goals, and safety checklist
Interaction Loop: User Simulator <-> AI Agent <-> Environment Engine (Tools)
Evaluation: HAICOSYSTEM-EVAL analyzes full trajectory against checklist

System Modules

User Simulator (Simulation)

Simulates a human user with specific profiles and intents (benign or malicious)

Model or implementation: GPT-4o

AI Agent

The system being tested; attempts to assist user while using tools

Model or implementation: Various (GPT-4-turbo, Llama3, etc.)

Environment Engine (Simulation)

Simulates the execution of tool calls and returns results to the agent

Model or implementation: GPT-4o (guided by tool descriptions)

Evaluator

Assesses the completed interaction trajectory for safety violations

Model or implementation: GPT-4o

Novel Architectural Elements

Ecosystem-level modularity where User, Agent, and Environment have distinct, partially hidden information states (e.g., Agent doesn't know User Goal)
Integration of a 'Checklist of Safe and Risky Outcomes' specifically tailored to each of the 132 scenarios

Modeling

Base Model: GPT-4o (for Simulator/Evaluator); Various target models (GPT-4, Llama-3, etc.)

Compute: Not reported in the paper (Evaluation-only framework)

Comparison to Prior Work

vs. DAN/WildTeaming: HAICOSYSTEM evaluates multi-turn interactions and tool use, not just single-turn text generation
vs. R-Judge: Simulates the interaction dynamically rather than judging fixed logs
vs. Sotopia: Focuses specifically on safety risks (legal, societal, operational) rather than general social intelligence

Limitations

Relies on LLMs (GPT-4o) to simulate users and environment, which may not perfectly reflect real-world behavior
Current scenarios may not cover all possible edge cases or domains
Evaluator bias: GPT-4o based evaluation may carry inherent models biases

Reproducibility

Code released in supplementary materials. 132 scenarios (21 manual, 111 synthetic/adapted) utilized. Simulator uses GPT-4o; exact prompts provided in Appendix B.1 and C.

📊 Experiments & Results

Evaluation Setup

Simulation of 132 scenarios across 7 domains (healthcare, finance, etc.) with 8,700 total episodes

Benchmarks:

HAICOSYSTEM Scenarios (Interactive Safety Simulation) [New]

Metrics:

Risk Ratio (proportion of risky episodes)
Tool Use Efficiency
Goal Completion
Statistical methodology: Pearson correlation used for validator agreement (0.8 reported)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
HAICOSYSTEM (Overall Risk)	Risk Ratio	0.49	0.67	+0.18
HAICOSYSTEM (Overall Risk)	Risk Ratio	0.47	0.35	-0.12
Human-LM Agreement	Pearson Correlation	0	0.8	+0.8

Experiment Figures

Radar charts or bar plots showing risk ratios across different dimensions (Targeted, System, Content, Societal, Legal) for various models.

Comparison of risk ratios across different interaction types (Benign vs Malicious Users, With vs Without Tools).

Main Takeaways

Larger models (Llama3.1-405B) generally have lower safety risks than smaller ones (GPT-3.5-turbo, Llama3.1-70B), likely due to better alignment training.
Models are most vulnerable during 'System and Operational' interactions (tool use), while 'Content' risks (toxicity) are relatively well-mitigated.
Malicious users significantly amplify risks, especially when combined with tool use (46% increase in risk probability).
Benign users can actually mitigate risks by providing clarifying information, a dynamic missed by static benchmarks.
Reasoning capabilities (O1 vs R1) do not uniformly translate to safety; R1 outperformed O1 in safety despite O1's stronger reasoning reputation.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM-based Agents and Tool Use
Familiarity with Red-teaming and Jailbreaking concepts
Basic knowledge of Reinforcement Learning or Simulation environments

Key Terms

Sandbox: A confined, safe testing environment where AI agents can be executed and observed without causing real-world harm

Red-teaming: The practice of simulating adversarial attacks (e.g., malicious users) to identify vulnerabilities in a system

Jailbreaking: Prompt engineering techniques designed to bypass an AI model's safety filters

FHIR: Fast Healthcare Interoperability Resources—a standard for exchanging healthcare information electronically, used here in medical scenarios

O1/R1: Reasoning models (from OpenAI and DeepSeek respectively) designed for complex logical tasks

Sotopia: A social simulation platform for AI agents used as a basis for user profiles in this work