On the Way to LLM Personalization: Learning to Remember User Conversations

📝 Paper Summary

Memory internalization Conversational personalization

PLUM replaces external retrieval systems by injecting conversation history directly into LLM parameters via Low-Rank Adaptation (LoRA) finetuning on synthetic question-answer pairs.

Core Problem

Retrieval Augmented Generation (RAG) for personalization requires managing external storage and context window limits, while existing finetuning methods struggle to update models with sequential conversation history efficiently.

Why it matters:

RAG-based methods deteriorate in performance as context windows grow and require maintaining complex external databases
Personalization requires remembering holistic conversation history, not just static user facts or style preferences
Prior work lacks a streamlined approach for parametric knowledge injection that respects the sequential nature of user interactions

Concrete Example: If a user previously discussed 'travel plans to Japan', a standard model might forget this in a new session. RAG would retrieve the raw logs. PLUM instead generates questions like 'Did we discuss Japan?' (Yes) and finetunes the model to answer correctly without accessing the old logs.

Key Novelty

Pipeline for Learning User Conversations (PLUM)

Augments conversation data by generating synthetic positive (factual) and negative (out-of-scope) question-answer pairs using a teacher LLM
Injects these memories into the model using LoRA adapters trained with a custom weighted cross-entropy loss that emphasizes question and answer tokens over instructions

Architecture

The PLUM pipeline: extracting Q/A pairs from a conversation and using them to finetune a LoRA adapter

Evaluation Highlights

Achieves 81.5% accuracy on memorizing 100 conversations, comparable to the 83.5% accuracy of a standard BM25 RAG baseline
Maintains general capabilities with negligible degradation on MMLU (65.65% base vs. 64.93% PLUM 5-shot)
Demonstrates that parametric memory can perform competitively with non-parametric retrieval methods in controlled settings

Breakthrough Assessment

6/10

A strong proof-of-concept for parametric memory as an alternative to RAG. While performance is slightly below RAG, it eliminates external storage requirements, marking a meaningful step toward internalized LLM memory.

⚙️ Technical Details

Problem Definition

Setting: Continual learning of user conversation history via parameter updates

Inputs: A sequence of user conversations c=(p, r)

Outputs: A finetuned model capable of answering questions about past conversations

Pipeline Flow

Data Augmentation: Conversation → Synthetic Q/A Pairs
Filtering: Validate Q/A Pairs
Sequential Finetuning: Base Model + LoRA → Updated Memory

System Modules

Data Augmentor

Generates positive and negative questions about the conversation history

Model or implementation: Llama 3 8B Instruct (or Llama 3 70B)

Memory Adapter

Stores and recalls conversation history within model parameters

Model or implementation: Llama 3 8B Instruct with LoRA adapter

Novel Architectural Elements

Use of a custom weighted cross-entropy loss that scales the loss on question and answer tokens by lambda=10 to focus learning on content rather than prompt structure

Modeling

Base Model: Llama 3 8B Instruct

Training Method: Sequential LoRA Finetuning with Teacher Forcing

Objective Functions:

Purpose: Focus the model on learning the specific memory content rather than the prompt template.

Formally: L = H(P, Q) where tokens for q_i and a_i are scaled by lambda=10, while x_sys and x_ins use standard weighting.

Adaptation: LoRA (rank=16, alpha=64) attached to all linear layers

Training Data:

100 conversations from OpenAssistant dataset
Augmented into 3726 positive and negative Q/A pairs
Negative samples (questions about topics NOT discussed) generated to prevent hallucination

Key Hyperparameters:

epochs: 10
batch_size: 8
lora_rank: 16
+ 2 more
lora_alpha: 64
loss_weight_lambda: 10

Compute: Not reported in the paper

Comparison to Prior Work

vs. RAG: PLUM stores memory in parameters (LoRA) instead of external documents, avoiding context window expansion
vs. Mecklenburg et al.: PLUM applies the injection specifically to user conversation history (sequential nature) rather than general facts

Limitations

Evaluation limited to 100 conversations, which may not reflect long-term usage
Requires finetuning per conversation, which introduces latency compared to instant RAG insertion
Catastrophic forgetting remains a challenge in sequential settings
Performance (81.5%) is still slightly below strong RAG baselines (83.5%)

Reproducibility

Prompts for data generation and filtering are provided in Appendices A, B, and C. The base model (Llama 3 8B) is public. No official code repository is provided. The dataset is a subset of OpenAssistant.

📊 Experiments & Results

Evaluation Setup

Train on 100 sequential conversations; Test on held-out 'yes/no' questions regarding the conversation content.

Benchmarks:

Memory Accuracy (Binary classification (Yes/No) on history) [New]
MMLU (General Knowledge)
HellaSwag (Commonsense Reasoning)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of PLUM against RAG baselines on memory retention accuracy over 100 conversations.
Memory Accuracy (100 conversations)	Accuracy	83.5	81.5	-2.0
Memory Accuracy (100 conversations)	Accuracy	81.5	81.5	0.0
Evaluation on general benchmarks to check for catastrophic forgetting of general capabilities.
MMLU (5-shot)	Accuracy	65.65	64.93	-0.72
ARC (Challenge, 25-shot)	Accuracy	59.39	58.45	-0.94

Main Takeaways

PLUM offers a viable parametric alternative to RAG for conversation history, achieving competitive accuracy (within 2%) without external storage
Negative samples (questions about what was *not* discussed) are critical; without them, the model defaults to answering 'yes' to everything
Weighted cross-entropy loss is essential to force the model to learn the specific memory content rather than just the instruction format

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval Augmented Generation (RAG)
Familiarity with Parameter-Efficient Finetuning (PEFT) and LoRA
Knowledge of catastrophic forgetting in sequential learning

Key Terms

PLUM: Pipeline for Learning User Conversations in Large Language Models—the proposed method for injecting memory via synthetic data and finetuning

LoRA: Low-Rank Adaptation—a PEFT technique that freezes pre-trained weights and injects trainable rank decomposition matrices to reduce trainable parameters

RAG: Retrieval Augmented Generation—systems that retrieve documents to augment the context window of an LLM

PEFT: Parameter-Efficient Finetuning—methods to adapt LLMs by updating only a small subset of parameters

Teacher Forcing: A training method where the model is fed the actual previous ground-truth tokens rather than its own generated predictions

Catastrophic Forgetting: The tendency of a neural network to completely forget previously learned information upon learning new information

Weighted Cross Entropy: A loss function modification where specific tokens (like the answer) contribute more to the gradient than others (like the system prompt)