Evaluation Setup
Evaluation on 5 healthcare safety benchmarks using accuracy, harmfulness scores, and attack success rates.
Benchmarks:
- SafetyBench (Multiple-choice questions on physical/mental health)
- MedSafetyBench (Medical ethics alignment (unsafe prompts))
- LLM Red-teaming (Realistic medical red-teaming (Safety, Hallucination, Privacy))
- Medical Triage (Ethical decision-making in resource allocation)
- MM-SafetyBench (Resilience to visual manipulation (Health Consultation))
Metrics:
- Accuracy
- Harmfulness Score (lower is safer)
- Proportion of Appropriate Responses
- Attribute-Dependent Accuracy
- Attack Success Rate (ASR)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| TAO consistently outperforms single-agent and multi-agent baselines across various safety benchmarks. |
| LLM Red-teaming |
Proportion of Appropriate Responses |
0.778 |
0.842 |
+0.064
|
| MedSafetyBench |
Harmfulness Score |
1.32 |
1.18 |
-0.14
|
| Medical Triage |
Attribute-Dependent Accuracy |
40 |
60 |
+20
|
| Ablation studies confirm the necessity of the adaptive, tiered architecture. |
| MedSafetyBench |
Safety Score (normalized) |
0.88 |
0.91 |
+0.03
|
| SafetyBench |
Error Absorption Rate |
0 |
24.3 |
+24.3
|
Main Takeaways
- TAO's hierarchical structure effectively filters errors, absorbing up to ~24% of individual agent mistakes before they impact the final decision.
- Lower tiers (Tier 1) are critical; removing them causes the most significant safety degradation, suggesting they act as an essential first line of defense.
- Adaptive tier configuration outperforms static assignments, validating the dynamic routing mechanism.
- Descending capability ordering (strongest models first) can be as safe as using strong models everywhere, offering a 'safety-first' efficiency trade-off.