| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Performance on DialSim dataset shows massive improvements over baselines. | ||||
| DialSim | F1 score | 2.55 | 3.45 | +0.90 |
| DialSim | F1 score | 1.18 | 3.45 | +2.27 |
| Token usage efficiency comparison showing cost reduction. | ||||
| Cost Analysis | Tokens per operation | 16900 | 1200 | -15700 |