SFT: Supervised Fine-Tuning—training a model on input-output pairs to teach it how to follow instructions or format responses
RLHF: Reinforcement Learning from Human Feedback—a method to align models with human values by training a reward model on human preferences and optimizing the policy using PPO
PPO: Proximal Policy Optimization—a reinforcement learning algorithm used to update the language model's policy to maximize the reward score without deviating too wildly from the previous policy
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes the pre-trained model weights and injects trainable rank decomposition matrices
CMeKG: Chinese Medical Knowledge Graph—a structured knowledge base used here to validate entities and filter low-quality data
Ziya-LLaMA: The foundational Chinese LLM (based on LLaMA) used as the starting point for this paper's medical adaptation
Proactive Inquiry: The ability of the model to ask follow-up questions to clarify a user's condition before giving a diagnosis, mimicking a real doctor
CMtMedQA: The custom dataset created in this paper containing 70,000 real-world multi-turn medical dialogues