Evaluation Setup
Multi-turn dialogue generation and question answering using standard datasets.
Benchmarks:
- MultiDoc2Dial (Goal-oriented dialogue with document grounding)
- QReCC (Open-domain question answering in conversation)
- TopiocChat (Knowledge-grounded conversation)
Metrics:
- BLEU (1, 2, 3, 4)
- ROUGE-L
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| DH-RAG consistently outperforms baselines across three different dialogue datasets on BLEU and ROUGE metrics. |
| MultiDoc2Dial |
BLEU-4 |
12.8 |
17.4 |
+4.6
|
| QReCC |
ROUGE-L |
31.2 |
34.1 |
+2.9
|
| TopiocChat |
BLEU-2 |
6.4 |
12.3 |
+5.9
|
Main Takeaways
- DH-RAG demonstrates robust performance improvements over static RAG models across diverse dialogue tasks (goal-oriented, open-domain QA, chitchat).
- The integration of dynamic history allows the model to maintain coherence over longer conversation turns compared to baselines.
- The hierarchical and clustering strategies effectively filter relevant historical context, preventing the 'lost in the middle' phenomenon often seen with long context windows.