BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT

📝 Paper Summary

Medical Dialogue Conversational Personalization

BianQue enhances health LLMs' ability to ask proactive clarifying questions by fine-tuning on a large-scale dataset where ChatGPT polishes the consultation history to balance questions and suggestions.

Core Problem

Current health LLMs provide generic suggestions based on single-turn user inputs and lack the ability to conduct 'Chain of Questioning' (CoQ) to fully understand a patient's condition before advising.

Why it matters:

Real-world doctors rely on iterative inquiries to provide personalized and effective advice, a capability missing in standard LLMs
Existing datasets and models assume users clearly describe problems in one turn, ignoring the diagnostic process
Lack of questioning leads to inadequate personalization, requiring users to self-filter generic advice

Concrete Example: In a pediatric consultation, a doctor might ask 9 rounds of questions (e.g., about a baby's symptoms) before diagnosing. Current LLMs usually skip this inquiry phase and immediately provide broad, non-targeted advice.

Key Novelty

BianQueCorpus and BianQue Model

Constructs a massive dataset (BianQueCorpus) by using ChatGPT to 'polish' (rewrite/expand) doctor suggestions from raw web data while retaining the original proactive questions
Balances the training data distribution to contain roughly equal parts questioning (46.2%) and suggestions (53.8%) to prevent the model from only learning to answer
Fine-tunes ChatGLM-6B specifically to learn the 'Chain of Questioning' capability alongside providing medical advice

Architecture

The construction process of the BianQueCorpus dataset.

Evaluation Highlights

BianQueCorpus contains 2,437,190 samples with a balanced distribution of questions (46.2%) and suggestions (53.8%)
Qualitative results indicate BianQue outperforms ChatGLM-6B, ChatGPT, and DoctorGLM on multi-turn conversation datasets (MedDialog-CN, IMCS-V2, etc.)
Data cleaning improved the 'excellent rate' of the corpus from 82% to 93% before training

Breakthrough Assessment

7/10

Addresses a critical gap in medical LLMs (lack of inquiry). The methodology of using ChatGPT to polish dataset suggestions to balance the distribution is practical and effective.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn health conversation generation

Inputs: Dialogue history consisting of patient utterances and doctor utterances (up to turn N-1) + current patient utterance

Outputs: Doctor's response (either a clarifying question or a health suggestion)

Pipeline Flow

Data Collection (Outsourcing)
Data Cleaning (Regex)
Data Polishing (ChatGPT)
Instruction Fine-tuning (ChatGLM-6B)

System Modules

Data Cleaner (Data Processing)

Remove noise from raw web-crawled conversations

Model or implementation: Regular Expressions (50 patterns)

Data Polisher (Data Processing)

Expand brief doctor suggestions into detailed responses to match high-quality LLM standards

Model or implementation: ChatGPT (gpt-3.5-turbo)

BianQue Model

Generate next turn response (Question or Suggestion)

Model or implementation: ChatGLM-6B (Fine-tuned)

Modeling

Base Model: ChatGLM-6B

Training Method: Supervised Fine-Tuning (SFT)

Trainable Parameters: Full fine-tuning (implied by hardware usage, though not explicitly specifying LoRA vs Full, memory usage suggests full or heavy adaptation)

Training Data:

BianQueCorpus: 2,437,190 samples
Questions: 46.2%
Suggestions: 53.8%

Key Hyperparameters:

warmup_steps: 1000
warmup_max_lr: 5e-5
max_input_length: 1536
+ 5 more
max_target_length: 512
batch_size: 80
global_training_steps: 25000
inference_top_p: 0.75
inference_temperature: 0.95

Compute: 8 NVIDIA A800-SXM4-80GB GPUs; Training time ~66 hours; Inference requires >14GB GPU memory

Comparison to Prior Work

vs. ChatDoctor/DoctorGLM: These models focus on single-turn QA or suggestion giving; BianQue focuses on multi-turn proactive questioning (CoQ).
vs. ChatGPT: BianQue is specifically fine-tuned for the Chinese medical domain to balance inquiry and advice, whereas ChatGPT tends to give advice immediately.

Limitations

Potential privacy risks: The model might ask for sensitive user information (age, gender) during proactive questioning.
Lack of RLHF: The current version uses only SFT; generated suggestions are not rigorously safety-checked by humans.
Accuracy guarantees: As a generative model, it cannot guarantee medical accuracy and is for academic research only, not real-world diagnosis.

Reproducibility

Code: https://github.com/scutcyr/BianQue

Code and data (BianQueCorpus) to be released. Hardware details provided (8x A800 GPUs). Training hyperparameters explicitly listed.

📊 Experiments & Results

Evaluation Setup

Multi-turn health dialogue generation on held-out test sets from existing benchmarks

Benchmarks:

MedDialog-CN (Multi-turn dialogue)
IMCS-V2 (Multi-turn dialogue)
CHIP-MDCFNPC (Multi-turn dialogue)
MedDG (Multi-turn dialogue)

Metrics:

BLEU-1/2/3/4
ROUGE-1/2/L
PQA (Proactive Questioning Ability)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MedDialog-CN	BLEU/ROUGE	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

A comparison between a real-world doctor's Chain of Questioning (CoQ) and a standard LLM's response.

Sample conversation of BianQue.

Main Takeaways

BianQue successfully balances questioning and suggestion generation, whereas baselines like ChatGLM and ChatGPT tend to give suggestions immediately.
The polishing strategy using ChatGPT allows the construction of a high-quality dataset where doctor suggestions are detailed enough for LLM training, while original questioning behavior is preserved.
The model demonstrates superior performance on multiple Chinese health dialogue benchmarks compared to general-purpose and other medical-specific LLMs.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs)
Instruction Fine-tuning
Supervised Learning

Key Terms

CoQ: Chain of Questioning—the process where a model/doctor asks a series of iterative questions to thoroughly understand a patient's condition

PQA: Proactive Questioning Ability—a metric defined by the authors to measure how often the model asks questions when the ground truth target is a question

BianQueCorpus: The self-constructed multi-turn health conversation dataset used to train the BianQue model

ChatGLM-6B: An open-source bilingual (Chinese-English) language model used as the backbone for BianQue

RLHF: Reinforcement Learning from Human Feedback—a training method used to align LLMs with human intent

SFT: Supervised Fine-Tuning—training a pre-trained model on specific labeled datasets to adapt it to a downstream task