Evaluation Setup
Customer service interaction (Airline, Retail) on τ-bench
Benchmarks:
- τ-bench (Tau-bench) (Multi-turn tool-use agent evaluation)
Metrics:
- User-Sim Index (USI)
- Sørensen–Dice coefficient (Behavioral alignment)
- Expected Calibration Error (ECE)
- Mean Absolute Error (Evaluative alignment)
- Statistical methodology: Three independent batches of human annotations used to measure inter-annotator agreement and result stability
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of overall simulator faithfulness (USI) between real humans (inter-annotator agreement) and the best performing LLM simulator. |
| τ-bench |
User-Sim Index (USI) |
92.9 |
76.0 |
-16.9
|
| Evaluative gap analysis showing how LLM-based judges overestimate agent quality compared to human judges. |
| τ-bench |
Human-likeness Rating Overestimation |
0 |
55 |
+55
|
| τ-bench |
Overall Score Overestimation |
0 |
18 |
+18
|
Main Takeaways
- Simulators create an 'easy mode' for agents: they are overly cooperative, stylistically uniform, and lack realistic frustration, causing agents to succeed more often than with humans
- Higher general model capability (e.g., GPT-5 family) does not necessarily yield more faithful user simulation or better evaluative alignment
- Rule-based rewards (binary success) are largely orthogonal to human perception of quality; humans value efficiency and interaction flow which binary checks miss
- LLM simulators front-load information and 'quietly pivot' on errors, whereas humans reveal info gradually and push back when agents fail