| Benchmark | Metric | Baseline | This Paper | Δ |
|---|---|---|---|---|
| Comparison of overall simulator faithfulness (USI) between real humans (inter-annotator agreement) and the best performing LLM simulator. | ||||
| τ-bench | User-Sim Index (USI) | 92.9 | 76.0 | -16.9 |
| Evaluative gap analysis showing how LLM-based judges overestimate agent quality compared to human judges. | ||||
| τ-bench | Human-likeness Rating Overestimation | 0 | 55 | +55 |
| τ-bench | Overall Score Overestimation | 0 | 18 | +18 |