Fine-grained Alignment of Large Language Models for General Medication Recommendation without Overprescription

📝 Paper Summary

Clinical Decision Support LLM Alignment in Healthcare

LAMO adapts LLMs for medication recommendation by processing unstructured clinical notes into structured inputs and using group-wise LoRA adapters to improve accuracy and prevent the severe overprescription seen in general LLMs.

Core Problem

General and medical LLMs exhibit severe overprescription (recommending far more drugs than necessary) and poor precision when recommending medications, while traditional systems fail to utilize rich unstructured clinical notes.

Why it matters:

Overprescription increases healthcare costs, elevates the risk of adverse drug events (approx. 3.5% of hospital admissions), and accelerates antimicrobial resistance
Existing systems rely on ID-based encoding and structured EHR data, missing 80% of clinical information residing in unstructured notes
Current LLMs lack nuanced understanding of drug-disease constraints, leading to 'shotgun' prescribing behavior

Concrete Example: When tasked with recommending medications for a patient, GPT-4 suggests over 80 medications on average, whereas practicing physicians prescribe approximately 23 (Ground Truth). ChatGLM3 frequently fails to generate any recommendations.

Key Novelty

Language-Assisted Medication recOmmendation (LAMO)

Mixture-of-Experts style adaptation where medications are clustered into groups, and a dedicated LoRA (Low-Rank Adaptation) adapter is trained for each group to capture specific pharmacological patterns
Pre-processing pipeline using GPT-3.5 to extract and summarize unstructured clinical notes (History, Allergies) into concise, structured text inputs for the recommendation model

Architecture

The data processing pipeline and fine-tuning architecture of LAMO

Evaluation Highlights

Outperforms existing methods by more than 10% in internal validation on the MIMIC-III dataset
Drastically reduces overprescription compared to GPT-4 (which suggests ~3x the volume of actual physicians)
Demonstrates strong generalization across temporal shifts (MIMIC-IV) and external multi-center data (eICU)

Breakthrough Assessment

7/10

Addresses a critical safety failure (overprescription) in applying LLMs to healthcare. The group-wise LoRA approach is a practical architectural innovation for handling large output spaces (medications) efficiently.

⚙️ Technical Details

Problem Definition

Setting: Binary classification for medication prescription based on clinical text context

Inputs: Structured clinical context (History of Present Illness, Past Medical History, Allergies, Medications on Admission) and a candidate medication

Outputs: Binary label indicating whether the candidate medication should be prescribed

Pipeline Flow

Data Extraction: GPT-3.5 extracts spans from raw notes → Summarization
Instruction Formatting: Concatenate extracted fields + Candidate Medication
Inference: LLaMA-2 + Group-Specific LoRA Adapter → Binary Decision

System Modules

Data Extractor

Parse raw discharge summaries into structured components (HPI, PMH, Allergies, MoA)

Model or implementation: GPT-3.5 (standardized prompts)

Medication Recommender

Decide if a specific medication fits the patient profile

Model or implementation: LLaMA-2-7B with Group-wise LoRA

Novel Architectural Elements

Group-wise LoRA architecture: distinct Low-Rank Adapters are assigned to different clusters of medications to allow specialized decision-making without full model fine-tuning

Modeling

Base Model: LLaMA-2-7B

Training Method: Supervised Fine-Tuning (Instruction Tuning) with Group-wise LoRA

Adaptation: LoRA (rank=8, alpha=16, dropout=0.05)

Trainable Parameters: Target modules: q_proj, v_proj

Training Data:

MIMIC-III split: 4:1:1 (train/val/test)
Input includes History of Present Illness, Past Medical History, Allergies, Medications on Admission

Key Hyperparameters:

learning_rate: 5e-4
batch_size: 64
scheduler: inverse square root
+ 1 more
inference_temperature: 0

Compute: Supported by Supercomputing Center of USTC (specific GPU hours not reported)

Comparison to Prior Work

vs. SafeDrug/MoleRec: LAMO utilizes unstructured clinical notes via LLMs rather than just structured ID codes
vs. GPT-4/MedAlpaca: LAMO employs fine-grained alignment (LoRA) to curb the massive overprescription tendency of general/medical LLMs
vs. RAREMed: LAMO generalizes across broader medication categories using group-wise adapters rather than focusing on rare disease encoding

Limitations

Relies on Supervised Fine-Tuning, which may encourage rote memorization of drug-knowledge pairs
Does not explicitly model adverse drug-drug interactions (DDIs) or synergistic effects beyond what is in the training data
Scalability issues for very large drug vocabularies due to the need for supervised data coverage
Currently treats each drug decision largely independently (binary classification) rather than optimizing the list as a coherent set

Reproducibility

Code: https://github.com/zzhUSTC2016/LAMO

Code available at https://github.com/zzhUSTC2016/LAMO. Prompts for data extraction (Supplementary Table S2) and instruction tuning are provided. MIMIC-III/IV and eICU datasets are restricted access (PhysioNet credentialing required).

📊 Experiments & Results

Evaluation Setup

Medication recommendation based on patient clinical history

Benchmarks:

MIMIC-III (Internal Validation (ICD-9 coding))
MIMIC-IV (Temporal Validation (ICD-10 coding))
eICU (External Validation (Multi-center))
PrimeKG (Knowledge Graph relation testing)

Metrics:

F1 score
#Med (Average number of recommended medications per patient)
Jaccard Index
Precision/Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Baseline analysis reveals that general purpose and medical LLMs struggle significantly with precision and overprescription compared to ground truth.
MIMIC-III	F1 score	0.3542	Not reported in the paper	Not reported in the paper
MIMIC-III	#Med (Avg Recommendations)	22.93	Not reported in the paper	Not reported in the paper
Ablation studies confirm the necessity of processing clinical notes and using concise titles.
MIMIC-III	Precision	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Comparison of F1 scores and average number of recommended medications (#Med) across LLMs and LAMO

Ablation studies on input factors and LoRA group counts

Main Takeaways

General LLMs (GPT-4, LLaMA) tend to 'shotgun' prescribe, recommending ~80 drugs when ~23 are needed, creating a safety hazard.
LAMO outperforms traditional ID-based methods (SafeDrug, MoleRec) by leveraging unstructured text, proving the value of clinical notes.
The model generalizes well to new coding standards (ICD-10 in MIMIC-IV) and new hospitals (eICU), unlike ID-based models which are often brittle to coding changes.
Group-wise LoRA allows the model to handle diverse medication categories efficiently without the computational cost of full fine-tuning.

📚 Prerequisite Knowledge

Prerequisites

Basics of Electronic Health Records (EHR)
Low-Rank Adaptation (LoRA) for LLMs
Instruction Tuning

Key Terms

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that injects small trainable matrices into a frozen model

EHR: Electronic Health Records—digital versions of patients' paper charts

MIMIC-III: Medical Information Mart for Intensive Care III—a widely used public critical care dataset

Overprescription: The practice of prescribing more medication than is clinically necessary

Instruction Tuning: Fine-tuning a language model on datasets formatted as instructions (input) and desired responses (output)

ICD-9/10: International Classification of Diseases—standard diagnostic tools for epidemiology, health management, and clinical purposes

PrimeKG: A comprehensive biomedical knowledge graph used to test the model's understanding of disease-drug relations