Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue

📝 Paper Summary

Medical Large Language Models Chinese Medical NLP Dialogue Systems

Zhongjing is a Chinese medical LLM trained via a complete pipeline (pre-training, SFT, RLHF) on a new real-world multi-turn dataset to enable proactive doctor-like inquiries and safety alignment.

Core Problem

Existing Chinese medical LLMs rely heavily on Supervised Fine-Tuning (SFT) with single-turn or distilled data, leading to rote memorization, lack of proactive inquiry, and misalignment with expert intent.

Why it matters:

Authentic medical diagnosis requires multi-turn interaction where doctors proactively ask questions to clarify conditions, which single-turn models cannot simulate
Over-reliance on SFT causes models to be overconfident and hallucinate, lacking the safety and 'don't know' awareness crucial for patient safety
Distilled data (from GPT) mimics speech patterns but may lead to a collapse of substantive inherent capabilities compared to learning from real expert data

Concrete Example: In a real scenario, if a patient says 'I have a headache', a real doctor asks 'How long?' or 'Is it throbbing?'. Current SFT models might immediately prescribe medicine based on the single input, missing critical diagnostic context.

Key Novelty

Full-Pipeline Training with Real-World Multi-turn Data (Zhongjing)

Implements the first complete training pipeline for a Chinese Medical LLM: Continuous Pre-training (knowledge) → SFT (dialogue capability) → RLHF (alignment and safety)
Constructs CMtMedQA, a large-scale dataset sourced from real doctor-patient dialogues that preserves the doctor's proactive inquiry behavior, unlike distilled datasets
Uses a medical-specific RLHF process with refined annotation rules covering safety, professionalism, and fluency to align the model with human experts

Architecture

The three-stage training pipeline of Zhongjing: (1) Continuous Pre-training on medical corpus, (2) Supervised Fine-Tuning (SFT) on various instruction data, and (3) RLHF utilizing a reward model trained on expert rankings.

Evaluation Highlights

Constructed CMtMedQA dataset with 70,000 authentic multi-turn doctor-patient dialogues across 14 medical departments
Matches the performance of ChatGPT in some specific medical abilities despite having only ~1% of the parameters (13B vs estimated larger size)
Significantly enhances capability for proactive inquiry initiation compared to SFT-only baselines

Breakthrough Assessment

7/10

Significant for implementing the full RLHF pipeline in the specific domain of Chinese medicine and releasing a high-quality real-world multi-turn dataset, moving beyond simple SFT approaches.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn medical dialogue generation

Inputs: Patient query q (natural language) and optional dialogue history

Outputs: Doctor-like response r (diagnosis, advice, or proactive inquiry)

Pipeline Flow

User Input → Zhongjing Model (LLaMA-based) → Medical Response

System Modules

Zhongjing Model

Generate medical responses or inquiries based on patient input

Model or implementation: Ziya-LLaMA-13B-v1 (fine-tuned via LoRA)

Modeling

Base Model: Ziya-LLaMA-13B-v1

Training Method: Full pipeline: Continuous Pre-training + SFT + RLHF (PPO)

Objective Functions:

Purpose: Reward Model Loss.

Formally: Minimize log(sigmoid(r_theta(x, y_better) - r_theta(x, y_worse)))
Purpose: PPO Optimization.

Formally: Maximize expected reward utilizing the trained reward model to guide policy updates

Adaptation: LoRA (Low-Rank Adaptation) used during non-pretraining stages

Training Data:

Pre-training: Medical Books, Wikis, CMeKG, Hospital Data (Table 1)
SFT: CMtMedQA (70k multi-turn), Single-turn dialogues, NLP tasks, General dialogue (Ratio 7:1 single-to-multi)
RLHF: 20,000 human-ranked sentences (6 medical experts)

Key Hyperparameters:

max_length: 4096
dropout: 0.1
optimizer: AdamW
+ 1 more
precision: fp16 with ZeRO-2

Compute: 4 A100-80G GPUs

Comparison to Prior Work

vs. ChatDoctor/BenTsao: Zhongjing implements RLHF (PPO) whereas these rely mainly on SFT
vs. HuatuoGPT: Zhongjing uses real medical experts for RLHF annotation rather than relying on ChatGPT feedback
vs. General Chinese LLMs: Zhongjing incorporates a specific 'Continuous Pre-training' phase on medical corpora before SFT

Limitations

No statistical significance tests reported in the provided text snippet.
Reliance on LLaMA-13B base restricts the potential reasoning capability compared to larger closed-source models like GPT-4.
Specific quantitative performance metrics (Accuracy, BLEU, etc.) are cut off in the provided text, limiting verifying the magnitude of improvement.

Reproducibility

Code: https://github.com/SupritYoung/Zhongjing

Code, datasets, and models are publicly available on GitHub. The paper specifies the base model (Ziya-LLaMA-13B-v1) and hardware (4 A100s). Exact values for learning rates and batch sizes are not fully detailed in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Comparative evaluation against baselines using GPT-4 and human experts.

Benchmarks:

CMtMedQA (Test Set) (Multi-turn medical dialogue) [New]

Metrics:

Safety
Professionalism
Fluency
Proactive Inquiry Capability
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The model implements a full training pipeline (Pre-training -> SFT -> RLHF), which is claimed to outperform SFT-only baselines in safety and instruction following.
The construction of CMtMedQA (70k real dialogues) is critical for teaching the model to initiate proactive inquiries, a trait often missing in models trained on distilled single-turn data.
RLHF with human expert feedback significantly improves the model's safety and aligns it with professional medical standards compared to the base SFT model.

📚 Prerequisite Knowledge

Prerequisites

Understanding of the standard LLM training pipeline (Pre-training, SFT, RLHF)
Familiarity with medical NLP challenges (hallucination, safety)
Basics of Reinforcement Learning (Reward Models, PPO)

Key Terms

SFT: Supervised Fine-Tuning—training a model on input-output pairs to teach it how to follow instructions or format responses

RLHF: Reinforcement Learning from Human Feedback—a method to align models with human values by training a reward model on human preferences and optimizing the policy using PPO

PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to update the language model's policy to maximize the reward score without deviating too wildly from the previous policy

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices

CMeKG: Chinese Medical Knowledge Graph—a structured knowledge base used here to validate entities and filter low-quality data

Ziya-LLaMA: The foundational Chinese LLM (based on LLaMA) used as the starting point for this paper's medical adaptation

Proactive Inquiry: The ability of the model to ask follow-up questions to clarify a user's condition before giving a diagnosis, mimicking a real doctor

CMtMedQA: The custom dataset created in this paper containing 70,000 real-world multi-turn medical dialogues