MiLP: Personalized LLM response generation w. Parameterized user memory injection

📝 Paper Summary

Memory internalization Personalization (P13N)

MiLP injects user history into Large Language Models as parameterized memory using multiple LoRA adapters, optimized via Bayesian Optimization to balance memory capacity and generation quality.

Core Problem

Existing personalization methods either suffer from limited context windows (prompt-based) or loss of fine-grained detail during retrieval (memory-based), struggling to effectively incorporate complex user histories.

Why it matters:

Context window limits prevent full utilization of long user histories in prompt-based approaches
Retrieval-based methods often miss fine-grained details due to the nature of similarity search
Generic responses in sensitive domains like healthcare can be inappropriate if patient history is ignored or fragmented

Concrete Example: In healthcare, a patient's long-term medical trajectory contains complex interactions. A standard retriever might fetch fragmented records that provide an incorrect snapshot of disease progression, leading the LLM to give generic or unsafe advice.

Key Novelty

Parameterized Memory-injected LLM Personalization (MiLP)

Mimics bionic memory by storing user history directly in the LLM's Feed Forward Layers using multiple LoRA adapters, rather than as external text
Treats the configuration of these adapters (which layers to inject, rank size, number of adapters) as a high-dimensional search problem solved by Bayesian Optimization

Architecture

The MiLP framework showing the Bayesian Optimization loop interacting with the LLM. It illustrates how the search space (layers, rank, number of LoRAs) is explored to minimize loss and maximize ROUGE-L.

Evaluation Highlights

Outperforms Text-prompt, Memory-augmented, and User-embedding baselines across AmazonQA, Reddit, and MedicalDialogue datasets
Achieves higher ROUGE-L and Persona-F1 scores on LLaMA2-13B compared to prompt-based personalization
Demonstrates superior Win Rate in human evaluation against standard generation methods

Breakthrough Assessment

7/10

Novel application of Bayesian Optimization to search the architecture of adapter-based memory injection. Addresses the 'where to store memory' problem in LLMs effectively.

⚙️ Technical Details

Problem Definition

Setting: Personalized response generation given user historical content

Inputs: User content U = {c_0, ..., c_n} (profile, history) and a current query x

Outputs: Personalized response y

Pipeline Flow

Configuration Search (Bayesian Optimization determines adapter structure)
Memory Injection (Insert LoRA adapters into FFL based on config)
Instruction Tuning (Fine-tune adapters on user data)
Inference (Generate response using injected memory)

System Modules

Bayesian Optimizer

Search for optimal memory configuration (layers, rank, count)

Model or implementation: SAAS-GP surrogate with NEHVI acquisition

Parameterized Memory

Store user history in learnable parameters

Model or implementation: Multiple LoRA adapters

Instruction Tuner

Align memory-injected model with user intent

Model or implementation: Memory-injected LLM

Novel Architectural Elements

Variable-structure memory injection: The architecture of the inserted memory (which layers, how many adapters, what rank) is dynamic and determined per-task/user via optimization
Multi-LoRA integration: Explicitly supports and optimizes for multiple LoRA modules rather than a single fixed adapter

Modeling

Base Model: DialoGPT, RoBERTa, LLaMA2-7B, LLaMA2-13B

Training Method: Supervised Fine-Tuning (SFT) of LoRA adapters

Objective Functions:

Purpose: Optimize the generation probability of the target response.

Formally: CrossEntropyLoss l = -1/N * Σ log P(y_i | y_<i, U, x)
Purpose: Maximize text overlap with ground truth during search.

Formally: ROUGE-L score

Adaptation: LoRA (rank and active layers determined by BO)

Trainable Parameters: LoRA adapter weights only (A and B matrices)

Training Data:

AmazonQA, Reddit, MedicalDialogue datasets
Split in user-oriented manner, formatted as next user content prediction

Key Hyperparameters:

learning_rate: 5e-4
weight_decay: 1e-4
batch_size: 8
+ 2 more
warmup_steps: 10% of total steps
optimizer: AdamW

Compute: Not reported in the paper

Comparison to Prior Work

vs. TpLP: Injects history into parameters instead of context window, avoiding length limits
vs. MaLP: Uses learnable parameters instead of similarity retrieval, capturing finer-grained details
vs. UeLP: Optimizes the injection architecture (layers, rank) via Bayesian Optimization rather than fixed projection [not cited in paper]

Limitations

Resources limitation prevented testing on LLMs larger than 13B
Optimization process adds computational overhead compared to static PEFT
Requires fine-tuning per user/task which may be costly at scale

Reproducibility

Code: https://github.com/MatthewKKai/MiLP

Code is publicly available at https://github.com/MatthewKKai/MiLP. Datasets are public (AmazonQA, Reddit, MedicalDialogue). Base models are standard HuggingFace transformers.

📊 Experiments & Results

Evaluation Setup

Personalized response generation using user history

Benchmarks:

AmazonQA (E-commerce QA and review generation)
Reddit (Social media dialogue generation)
MedicalDialogue (Medical consultation generation)

Metrics:

ROUGE-1
ROUGE-L
Persona F1 (P-F1)
Win Rate (Human Eval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper does not provide extractable numeric tables in the text provided. The abstract mentions 'significant improvements' and the introduction claims superiority, but specific numbers (e.g., 'ROUGE-L of 25.4') are absent from the provided snippet. The text references Table 1 for dataset comparisons but not for results. Qualitative summaries must be used.

Main Takeaways

MiLP demonstrates superiority over baselines (TpLP, MaLP, UeLP) across three datasets (AmazonQA, Reddit, MedicalDialogue).
The Bayesian Optimization strategy effectively identifies optimal memory configurations (layers, rank) that fixed PEFT approaches miss.
Human evaluation confirms that MiLP generates more personalized and appropriate responses compared to text-prompting methods.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer Feed-Forward Layers (FFL)
Parameter-Efficient Fine-Tuning (PEFT), specifically LoRA
Bayesian Optimization (BO) and Gaussian Processes

Key Terms

LoRA: Low-Rank Adaptation—a PEFT technique that freezes pre-trained weights and injects trainable rank decomposition matrices

PEFT: Parameter-Efficient Fine-Tuning—methods to adapt LLMs without retraining all parameters

SAAS-GP: Sparse Axis-Aligned Subspace Gaussian Process—a surrogate model for Bayesian Optimization suitable for high-dimensional spaces

NEHVI: Negative Expected Hypervolume Improvement—an acquisition function for multi-objective Bayesian Optimization

FFL: Feed Forward Layers—components of the Transformer block where memory/knowledge is hypothesized to be stored

ROUGE-L: A metric based on Longest Common Subsequence to measure text overlap

Persona F1: A metric measuring the unigram overlap between the generated response and the user's historical profile/content