Evaluation Setup
Offline evaluation on synthetic dialogues (ATOD dataset) and Online evaluation simulation
Benchmarks:
- ATOD Dataset (Agentic Task-Oriented Dialogue) [New]
Metrics:
- Goal Detection Accuracy
- Status Tracking Accuracy
- Dependency-Aware Goal Completion Rate (dGCR)
- Memory Recall Accuracy
- Turns to Completion (NTC)
- Latency (seconds/turn)
- Token Usage
- Statistical methodology: Reported Pearson's r and Spearman's rho for metric validity correlations
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Performance of the proposed memory-based evaluator against baselines on Goal Detection and Status Tracking accuracy. |
| ATOD (Complex) |
Goal Detection Accuracy |
0.62 |
0.91 |
+0.29
|
| ATOD (Complex) |
Status Tracking Accuracy |
0.55 |
0.85 |
+0.30
|
| ATOD |
Average Latency (s/turn) |
180 |
25 |
-155
|
| ATOD |
Correlation with dGCR (Pearson r) |
0.05 |
0.88 |
+0.83
|
Main Takeaways
- Memory-augmented evaluation is essential for advanced TOD: Zero-shot LLM judges degrade rapidly as dialogue complexity and length increase.
- Dependency-aware metrics are required: Traditional success rates fail to account for blocked goals in interleaved workflows.
- The proposed dual-store memory system offers a superior trade-off between accuracy and computational cost compared to full-context summarization methods like LLM-Rsum.
- Memory Recall Accuracy correlates most strongly with task success (dGCR), suggesting that 'remembering' is the bottleneck for current agents.