Evaluation Setup
Task-oriented dialogues in Retail and Airline domains, involving database reading/writing and policy enforcement
Benchmarks:
- IntellAgent Benchmark (Synthetic) (Conversational Agent Evaluation) [New]
- tau-bench (Conversational Agent Evaluation)
Metrics:
- Success Rate (Pass/Fail based on goal completion and policy adherence)
- Pearson Correlation (between IntellAgent and tau-bench scores)
- Statistical methodology: Pearson correlation coefficient reported to validate alignment with tau-bench
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Validation results showing that IntellAgent's synthetic evaluation strongly correlates with the manually curated tau-bench. |
| tau-bench (Airline) |
Pearson Correlation |
1.0 |
0.98 |
0.02
|
| tau-bench (Retail) |
Pearson Correlation |
1.0 |
0.92 |
0.08
|
Main Takeaways
- Model performance declines consistently as scenario complexity (sum of policy weights) increases, but the rate of decline varies by model (e.g., Gemini-1.5-pro maintains performance longer than GPT-4o-mini)
- High correlation with manual benchmarks (0.92-0.98) proves that purely synthetic, graph-driven evaluation is a reliable proxy for human-curated tests
- Policy-specific analysis reveals hidden gaps: nearly all models fail on 'User Consent' policies, a blind spot in previous benchmarks like tau-bench
- Weighted probability sampling in the policy graph balances diversity and realism better than uniform or max-weight sampling