Evaluation Setup
Simulation of 132 scenarios across 7 domains (healthcare, finance, etc.) with 8,700 total episodes
Benchmarks:
- HAICOSYSTEM Scenarios (Interactive Safety Simulation) [New]
Metrics:
- Risk Ratio (proportion of risky episodes)
- Tool Use Efficiency
- Goal Completion
- Statistical methodology: Pearson correlation used for validator agreement (0.8 reported)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| HAICOSYSTEM (Overall Risk) |
Risk Ratio |
0.49 |
0.67 |
+0.18
|
| HAICOSYSTEM (Overall Risk) |
Risk Ratio |
0.47 |
0.35 |
-0.12
|
| Human-LM Agreement |
Pearson Correlation |
0 |
0.8 |
+0.8
|
Main Takeaways
- Larger models (Llama3.1-405B) generally have lower safety risks than smaller ones (GPT-3.5-turbo, Llama3.1-70B), likely due to better alignment training.
- Models are most vulnerable during 'System and Operational' interactions (tool use), while 'Content' risks (toxicity) are relatively well-mitigated.
- Malicious users significantly amplify risks, especially when combined with tool use (46% increase in risk probability).
- Benign users can actually mitigate risks by providing clarifying information, a dynamic missed by static benchmarks.
- Reasoning capabilities (O1 vs R1) do not uniformly translate to safety; R1 outperformed O1 in safety despite O1's stronger reasoning reputation.