| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Tool extraction performance validates the offline phase's ability to correctly identify tools across different frameworks. | ||||
| ReAct Agents (500 instances) | Precision | Not applicable | 1.0 | Not applicable |
| ReAct Agents (500 instances) | Recall | Not applicable | 0.997 | Not applicable |
| Safety policy enforcement experiments demonstrate AgenTRIM's ability to prevent policy breaches (e.g., executing a function without a required safety check). | ||||
| Custom Safety Policy Suite | Policy Breach Rate (PBR) | Not explicitly reported in the paper | 0.0 | Not applicable |
| Custom Safety Policy Suite | F1 (Safety Tool Usage) | Low (qualitative) | 0.995 | Not applicable |