Evaluation Setup
Generation of dialogue benchmarks using 4 real-world KGs (DBpedia, YAGO-4, DBLP, YAGO-3).
Benchmarks:
- DBpedia (General Domain KG)
- YAGO (3 & 4) (General Domain KG)
- DBLP (Academic/Scientific KG)
Metrics:
- Success Rate (valid dialogues generated)
- Relevance (human eval)
- Correctness (human eval)
- Coherence (human eval)
- Processing Time
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Time efficiency results demonstrate massive improvements over the rule-based baseline. |
| DBpedia (End-to-End Generation) |
Processing Time |
30 hours |
10 minutes |
-29 hours 50 minutes
|
| Quality evaluation showing high performance across different LLMs. |
| DBpedia/YAGO/DBLP average |
Success Rate |
90 |
100 |
+10
|
| Human evaluation of dialogue quality. |
| Generated Dialogues |
Average Score (Relevance, Correctness, Coherence) |
Not reported in the paper |
4.67 (out of 5) |
Not reported in the paper
|
Main Takeaways
- Chatty-Gen significantly outperforms the state-of-the-art system Maestro in time efficiency (99% reduction for DBpedia).
- The multi-stage pipeline allows open-source models (Llama-3, CodeLlama) to achieve success rates and quality scores comparable to commercial SOTA models (GPT-4o).
- The assertion-based validation successfully mitigates hallucinations, ensuring high correctness in generated SPARQL queries and dialogue content.