Evaluation Setup
Conversational recommendation on real-world dialogue datasets
Benchmarks:
- DuRecDial (Conversational Recommendation)
- DuRecDial 2.0 (Bilingual Conversational Recommendation)
- MultiWOZ (Task-oriented dialogue (adapted for recommendation))
Metrics:
- Conversation Success Rate
- NDCG@10 (Recommendation Accuracy)
- Conversation Efficiency (turns to success)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| DuRecDial |
Conversation Success Rate improvement |
Not reported in the paper |
Not reported in the paper |
+2.8%
|
| DuRecDial |
NDCG@10 improvement |
Not reported in the paper |
Not reported in the paper |
+1.9%
|
| DuRecDial |
Conversation Efficiency improvement |
Not reported in the paper |
Not reported in the paper |
+3.2%
|
Main Takeaways
- Consistent improvements across three diverse datasets (DuRecDial, DuRecDial 2.0, MultiWOZ).
- The hierarchical strategy effectively handles varying query complexities: 70% handled by rapid response, 25% by intelligent reasoning, 5% by deep collaboration.
- Adaptive coordination allows the system to balance conflicting objectives (accuracy vs efficiency) better than single-agent baselines.