MemoryBank: Enhancing Large Language Models with Long-Term Memory

📝 Paper Summary

Memory recall Memory organization User modeling

MemoryBank equips Large Language Models with a long-term memory system that stores interaction history, summarizes events, and updates memory strength over time using the Ebbinghaus Forgetting Curve.

Core Problem

LLMs lack a built-in long-term memory mechanism, making them unable to recall historical interactions, maintain long-term context, or adapt to user personalities over time.

Why it matters:

Essential for sustained interaction scenarios like personal companions, psychological counseling, and secretarial assistants
Current LLMs cannot build rapport by referencing past shared experiences or evolving user understanding
Absence of memory leads to repetitive or contextually unaware responses in long-term dialogues

Concrete Example: A user tells an AI they broke up with their girlfriend on Monday. On Friday, when the user mentions feeling sad, a standard LLM asks 'Why?' or gives generic advice, failing to recall the breakup event. SiliconFriend recalls the breakup and the user's personality to offer specific emotional support.

Key Novelty

MemoryBank Mechanism with Ebbinghaus Forgetting

Introduces a dual-level memory storage: raw conversation logs and high-level summaries (events and user portraits)
Implements a memory updating mechanism inspired by the Ebbinghaus Forgetting Curve, where memory strength decays over time unless reinforced by recall, mimicking human forgetting
Uses a 'SiliconFriend' framework that tunes models on psychological data and retrieves relevant memories to generate empathetic, personalized responses

Architecture

The overall architecture of MemoryBank and its integration into SiliconFriend

Evaluation Highlights

+0.10 improvement in Retrieval Accuracy for ChatGLM (0.84 vs 0.74 inferred) on Chinese long-term memory tasks [inferred from comparison context]
SiliconFriend-ChatGPT achieves 0.912 Contextual Coherence score on English memory probing tasks, significantly outperforming open-source baselines
Demonstrates strong generalization across languages (English/Chinese) and model types (open-source ChatGLM/BELLE vs. closed-source ChatGPT)

Breakthrough Assessment

7/10

Novel integration of psychological forgetting curves into LLM memory management. While the architecture is a straightforward RAG variant, the forgetting mechanism and specific application to psychological companionship are well-executed contributions.

⚙️ Technical Details

Problem Definition

Setting: Long-term open-domain conversation with memory retrieval and personality adaptation

Inputs: Current user query q and historical interaction logs

Outputs: Response r that incorporates relevant past memories and adapts to user personality

Pipeline Flow

Memory Storage (Stores logs, summaries, portraits)
Memory Retrieval (Finds relevant past info)
Memory Updating (Forgetting curve application)
Response Generation (LLM produces answer)

System Modules

Memory Storage

Stores raw dialogs, hierarchical event summaries, and dynamic user portraits

Model or implementation: Database / Vector Index

Memory Retriever

Retrieves relevant memory pieces based on current context

Model or implementation: Dual-tower dense retrieval (MiniLM for English, Text2vec for Chinese) with FAISS

Memory Updater

Updates memory strength and forgets items based on time elapsed and recall frequency

Model or implementation: Exponential decay algorithm (R = e^(-t/S))

Response Generator

Generates final response using retrieved memory and user profile

Model or implementation: LLM (ChatGPT, ChatGLM, or BELLE)

Novel Architectural Elements

Ebbinghaus-inspired memory updater: A dedicated module that mathematically decays memory availability based on time (t) and strength (S)
Hierarchical memory injection: Prompt includes specific slots for 'Relevant Memory', 'User Portrait', and 'Event Summary' synthesized from raw logs

Modeling

Base Model: Evaluated with ChatGPT (closed), ChatGLM-6B, and BELLE-7B (based on LLaMA)

Training Method: Supervised Fine-Tuning (SFT) via LoRA

Adaptation: LoRA (rank=16)

Training Data:

38k psychological dialogues parsed from online sources

Key Hyperparameters:

lora_rank: 16
epochs: 3
gpu: A100

Compute: Training performed on A100 GPU

Comparison to Prior Work

vs. Standard LLMs: MemoryBank adds persistent storage and retrieval across days/sessions
vs. NTMs: MemoryBank uses semantic/textual memory compatible with LLM prompting rather than differentiable memory matrices
vs. Long-term Persona Memory: MemoryBank introduces Ebbinghaus forgetting curve to mimic natural memory decay vs. static storage
+ 1 more
vs. MemGPT [not cited in paper]: MemoryBank focuses on psychological forgetting curves, whereas MemGPT focuses on OS-like memory hierarchy management

Limitations

Memory updating model is exploratory and highly simplified compared to real human cognitive processes
Forgetting curve implementation assumes discrete updates to strength S, which may not capture nuance
Performance of open-source backbones (ChatGLM/BELLE) lags behind ChatGPT in coherence even with MemoryBank

Reproducibility

Code: https://github.com/zhongwanjun/MemoryBank-SiliconFriend

Code and materials released at https://github.com/zhongwanjun/MemoryBank-SiliconFriend. Uses public models (ChatGLM, BELLE) and APIs (ChatGPT). Psychological dataset details (38k dialogs) mentioned but exact source URLs or raw data files not explicitly linked in text.

📊 Experiments & Results

Evaluation Setup

Simulated long-term conversation (10 days) with 15 virtual users interacting with the system

Benchmarks:

Simulated Long-Term Dialogs (Memory Probing Question Answering) [New]

Metrics:

Memory Retrieval Accuracy (0/1)
Response Correctness (0/0.5/1)
Contextual Coherence (0/0.5/1)
Model Ranking Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative performance of SiliconFriend variants (using different LLM backbones) on the English memory probing dataset.
Simulated Dialogs (English)	Retrieval Accuracy	0.809	0.763	-0.046
Simulated Dialogs (English)	Coherence	0.68	0.912	+0.232
Comparative performance of SiliconFriend variants on the Chinese memory probing dataset.
Simulated Dialogs (Chinese)	Retrieval Accuracy	0.84	0.856	+0.016
Simulated Dialogs (Chinese)	Correctness	0.418	0.655	+0.237

Experiment Figures

Qualitative comparison of SiliconFriend ChatGLM vs. standard ChatGLM in a psychological counseling scenario

Example of SiliconFriend BELLE successfully recalling specific past details (book recommendations, code requests) and identifying false memories

Main Takeaways

SiliconFriend-ChatGPT consistently achieves the highest correctness and coherence scores across both languages, validating the framework's effectiveness when paired with a strong base model
Open-source models (BELLE, ChatGLM) achieve competitive or superior Retrieval Accuracy compared to ChatGPT, but lag significantly in Response Correctness and Coherence
Qualitative analysis confirms the system can recall specific details (book names, algorithms) from days prior and correctly identify events that did *not* happen
The system successfully adapts recommendations based on synthesized user portraits (e.g., suggesting hiking for an 'outdoor' personality vs. museums for a 'curious' personality)

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Basic understanding of Transformer-based LLMs
Dense vector retrieval (embeddings)

Key Terms

Ebbinghaus Forgetting Curve: A psychological theory describing how memory retention declines exponentially over time unless information is reviewed or recalled

MemoryBank: The proposed module containing memory storage, retrieval, and updating mechanisms

SiliconFriend: The specific chatbot application built using MemoryBank and tuned on psychological dialogs

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices

FAISS: Facebook AI Similarity Search—a library for efficient similarity search and clustering of dense vectors

Dual-tower dense retrieval: A retrieval architecture where queries and documents are encoded separately into vectors, and relevance is calculated via dot product or cosine similarity

GLM: General Language Model—the architecture underlying the ChatGLM model

BELLE: An open-source bilingual language model optimized for Chinese conversation

RAG: Retrieval-Augmented Generation—enhancing model responses by fetching relevant external data