Evaluation Setup
Multi-turn health dialogue generation on held-out test sets from existing benchmarks
Benchmarks:
- MedDialog-CN (Multi-turn dialogue)
- IMCS-V2 (Multi-turn dialogue)
- CHIP-MDCFNPC (Multi-turn dialogue)
- MedDG (Multi-turn dialogue)
Metrics:
- BLEU-1/2/3/4
- ROUGE-1/2/L
- PQA (Proactive Questioning Ability)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MedDialog-CN |
BLEU/ROUGE |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
Main Takeaways
- BianQue successfully balances questioning and suggestion generation, whereas baselines like ChatGLM and ChatGPT tend to give suggestions immediately.
- The polishing strategy using ChatGPT allows the construction of a high-quality dataset where doctor suggestions are detailed enough for LLM training, while original questioning behavior is preserved.
- The model demonstrates superior performance on multiple Chinese health dialogue benchmarks compared to general-purpose and other medical-specific LLMs.