| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison of different LLMs acting as the agent in ALRM, evaluated in both Tool-as-Policy (TaP) and Code-as-Policy (CaP) modes. | ||||
| ALRM Benchmark | Success Rate (TaP) | Not reported in the paper | 93.5 | Not reported in the paper |
| ALRM Benchmark | Success Rate (CaP) | Not reported in the paper | 92.6 | Not reported in the paper |
| ALRM Benchmark | Success Rate (CaP) | 84.3 | 84.3 | 0.0 |