| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Results on the new When2Call benchmark show that RPO training significantly improves decision-making accuracy compared to baselines and standard SFT. | ||||
| When2Call | Accuracy | 83.1 | 91.7 | +8.6 |
| When2Call | Accuracy | 57.7 | 91.7 | +34.0 |
| Performance on BFCL shows that When2Call training improves irrelevance detection without destroying tool-calling ability. | ||||
| BFCL Live (Irrelevance) | Accuracy | 46.2 | 87.1 | +40.9 |
| BFCL Live (AST) | Accuracy | 84.9 | 85.9 | +1.0 |