LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination

📝 Paper Summary

Memory organization Conversational personalization

MaLP enhances personalized medical assistants by combining a neuroscience-inspired dual-process memory system (Working, Short-Term, Long-Term) with parameter-efficient fine-tuning to adapt to patient preferences.

Core Problem

Existing medical assistants lack personalization, and standard memory modules (dictionary-based) are inflexible, while fully fine-tuning LLMs for every patient is resource-prohibitive.

Why it matters:

Patients have diverse communication preferences (e.g., concise vs. detailed explanations) that generic models ignore
Dictionary-based memories (key-value pairs of mistakes/feedback) are rigid and rely heavily on retrieval accuracy
Catastrophic forgetting occurs when adapting models to the medical domain without careful regularization

Concrete Example: A diabetes patient preferring concise advice might receive a lengthy technical explanation about glucose tests from a generic model because the model cannot effectively recall and apply the patient's preference for brevity.

Key Novelty

Dual-Process enhanced Memory (DPeM)

Mimics human memory using three tiers: Working Memory (buffer), Short-Term Memory (STM), and Long-Term Memory (LTM), managed by 'Rehearsal' and 'Executive' processes
Uses a frequency-based promotion mechanism where information frequently accessed in STM is automatically transferred to LTM
Combines this memory structure with Low-Rank Adaptation (LoRA) to fine-tune the generator for user-specific nuances without retraining the full model

Architecture

The Dual-Process enhanced Memory (DPeM) mechanism and MaLP framework.

Evaluation Highlights

Achieves a relatively 7% improvement against existing memory structures (claimed in abstract)

Breakthrough Assessment

6/10

Proposes a biologically plausible memory architecture (DPeM) that moves beyond simple vector stores, integrated with PEFT. However, the summary relies on abstract claims as the results section is truncated in the source text.

⚙️ Technical Details

Problem Definition

Setting: Multi-round dialogue generation where an LLM produces personalized response y given query x, history D, and memory M

Inputs: New query x and historical dialogues D

Outputs: Personalized response y

Pipeline Flow

Rehearsal Process: Coordinator extracts notes from dialogue -> Working Memory
Executive Process: Frequency check promotes items from STM -> LTM
Retrieval: Hybrid retrieval (Distance-based for STM, Semantic for LTM)
Generation: LoRA-tuned LLM generates response using retrieved context

System Modules

Coordinator

Learns from dialogue content and summarizes notes for the working memory

Model or implementation: Powerful LLM (e.g., ChatGPT)

Short-Term Memory (STM) Retriever (Retrieval)

Retrieves recent relevant knowledge using string matching

Model or implementation: Levenshtein distance calculator

Long-Term Memory (LTM) Retriever (Retrieval)

Retrieves persistent knowledge using semantic association

Model or implementation: Encoder for semantic embeddings

Generator

Generates the final personalized medical response

Model or implementation: Base LLM (e.g., LLaMA) with LoRA adapters

Novel Architectural Elements

Three-tier memory hierarchy (Working, STM, LTM) with automatic promotion based on access frequency (threshold theta)
Hybrid retrieval mechanism using Levenshtein distance for STM and Semantic embedding for LTM within the same pipeline

Modeling

Base Model: LLaMA (implied by text examples, though text says 'e.g. LLaMA')

Training Method: Domain Adaptation followed by Low-Rank Adaptation (LoRA)

Objective Functions:

Purpose: Prevent catastrophic forgetting during domain adaptation.

Formally: L_S = ||V_o, V_k||^2_2 (Sample Loss: L2 distance between original and adapter-modified vector representations)
Purpose: Learn medical knowledge via masked token prediction.

Formally: L_K = - (1/K) * sum(log p(m_i)) (Knowledge Loss on masked tokens)

Adaptation: LoRA (Low-Rank Adaptation)

Key Hyperparameters:

rank: Not reported in the paper (text truncated)
learning_rate: Not reported in the paper (text truncated)

Comparison to Prior Work

vs. Dictionary-based: MaLP uses a dual-process dynamic memory (Working->STM->LTM) rather than a static mistake-correction dictionary
vs. Prompt-based: MaLP utilizes PEFT (LoRA) for parameter updates, claiming better performance than prompt engineering alone
vs. Full Fine-tuning: MaLP uses PEFT to reduce resource consumption while maintaining personalization capabilities

Limitations

The approach relies on an external Coordinator (e.g., ChatGPT) for summarization, introducing a dependency on powerful external models
Exact computational savings compared to full fine-tuning are not quantified in the provided text
The memory promotion threshold (theta) is a hyperparameter that may need tuning for different contexts

Reproducibility

Code: https://github.com/MatthewKKai/MaLP

Code is publicly available at https://github.com/MatthewKKai/MaLP. The paper mentions releasing a new conversation dataset generated based on an open-source medical corpus.

📊 Experiments & Results

Evaluation Setup

Not fully reported in the provided text (text truncated before experiments section).

Benchmarks:

New medical conversation dataset (Medical dialogue generation) [New]

Metrics:

Not reported in the paper (text truncated)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The Abstract claims a 7% relative improvement over existing memory structures, though specific breakdown is unavailable due to text truncation.
The proposed DPeM mechanism is designed to balance user-specific preferences (via LoRA and memory) with general medical knowledge.
A new dataset was created to support research into personalized medical assistants, incorporating user preferences and history.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Parameter-Efficient Fine-Tuning (PEFT/LoRA)
Basic knowledge of Dual-Process Theory in neuroscience

Key Terms

DPeM: Dual-Process enhanced Memory—the proposed mechanism dividing memory into Working, Short-Term, and Long-Term components

MaLP: Medical Assistant with Long- and short-term memory and PEFT—the overall framework proposed in the paper

PEFT: Parameter-Efficient Fine-Tuning—methods to adapt LLMs by updating only a small subset of parameters

LoRA: Low-Rank Adaptation—a specific PEFT technique that injects trainable low-rank matrices into frozen model layers

STM: Short-Term Memory—stores relevant recent knowledge in the DPeM framework

LTM: Long-Term Memory—stores frequently accessed knowledge for longer duration in the DPeM framework

Coordinator: A module (likely an LLM) responsible for taking notes and summarizing dialogue content into Working Memory

Levenshtein distance: A string metric for measuring the difference between two sequences, used here for STM retrieval

Catastrophic forgetting: The tendency of an artificial neural network to completely and abruptly forget previously learned information upon learning new information