PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering

📝 Paper Summary

Memory organization Memory recall

PerLTQA is a comprehensive dataset and evaluation framework that integrates both semantic (profiles, relationships) and episodic (events, dialogues) memories to benchmark LLM capabilities in personalized long-term memory question answering.

Core Problem

Existing QA and dialogue datasets typically focus on either world knowledge (semantic) or session history (episodic) in isolation, lacking a unified resource that combines personal profiles, social relationships, events, and dialogues for long-term memory evaluation.

Why it matters:

Personalized assistants require integrated access to both static facts (semantic) and dynamic history (episodic) to generate human-like responses
Current benchmarks do not adequately test an LLM's ability to distinguish between and synthesize different memory types (e.g., retrieving a friend's name vs. recalling a specific shared event)
Research gaps exist in explicit annotations for social relationships and event-based episodic memory within a single QA framework

Concrete Example: A user asks, 'Who did I go hiking with last year?' To answer, the system must retrieve the specific event (episodic) and link it to the person involved (semantic relationship), whereas current systems might only look at recent dialogue history or generic facts.

Key Novelty

Unified Semantic-Episodic Memory Benchmark & Pipeline

Constructs a semi-synthetic dataset (PerLTQA) comprising 141 characters with detailed profiles, social webs, life events, and historical dialogues using an in-context generation approach
Proposes a three-stage evaluation framework: Memory Classification (identify memory type needed), Memory Retrieval (re-rank based on type), and Memory Synthesis (generate answer)
Introduces 'memory anchors'—annotated key text segments in answers—to precisely evaluate whether the model used the correct retrieved memory during synthesis

Architecture

The proposed framework for memory integration in QA, showing the flow from question to answer via classification, retrieval, and synthesis.

Evaluation Highlights

BERT-based classifiers achieve significantly higher accuracy in memory type classification compared to LLMs, outperforming ChatGLM3 and ChatGPT
Memory classification assists retrieval: Using classification probabilities to re-rank memories improves retrieval performance
Retrieval accuracy is critical: LLMs show varied proficiency in synthesis even when provided with perfect memories, highlighting the need for better integration capabilities

Breakthrough Assessment

7/10

Provides a much-needed comprehensive dataset bridging semantic and episodic memory for personalization. While the method (pipeline) is standard, the data resource and granular annotation (memory anchors) are significant contributions.

⚙️ Technical Details

Problem Definition

Setting: Personalized Question Answering with Long-Term Memory Integration

Inputs: User question q and a personal memory database M containing semantic (profiles, relationships) and episodic (events, dialogues) entries

Outputs: Natural language answer a that incorporates relevant information from M

Pipeline Flow

Memory Classification (Determine if question needs semantic or episodic memory)
Memory Retrieval (Retrieve top-k candidates from database)
Memory Synthesis (Generate answer using retrieved context)

System Modules

Memory Classifier

Predict the probability that a question relates to a specific memory type (semantic vs. episodic)

Model or implementation: BERT (fine-tuned) or LLM (ChatGLM3/ChatGPT via prompting)

Memory Retriever

Retrieve and re-rank memory entries based on similarity and classification score

Model or implementation: Contriever / DPR / BM25 (as base retrievers)

Memory Synthesizer

Generate the final natural language answer using top-k retrieved memories

Model or implementation: LLMs (ChatGLM3, ChatGPT, Llama-2-chat, etc.)

Novel Architectural Elements

Integration of an explicit memory classification step that re-weights retrieval scores based on the predicted need for semantic vs. episodic memory

Modeling

Base Model: Various (BERT for classification; ChatGLM3, ChatGPT, Llama-2-chat, Baichuan2, Qwen-chat for synthesis)

Training Method: Supervised Fine-Tuning (for the BERT classifier only)

Adaptation: Full fine-tuning of BERT

Training Data:

141 generated characters
1,339 social relationships
4,501 events
3,409 dialogues
8,593 QA pairs

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemoryBank: PerLTQA includes explicit structured semantic memory (profiles, relationships) alongside episodic events, whereas MemoryBank focuses on summarizing dialogue history.
vs. MSC: PerLTQA incorporates a wider range of memory types including social relationships and distinct life events, rather than just multi-session conversation logs.
vs. Standard RAG: Introduces a classification-aware retrieval step to distinguish between memory types before generation.

Limitations

Dataset is semi-synthetic (generated by ChatGPT) rather than collected from real human users, potentially limiting naturalism.
Memory anchor annotation was manually verified for only 30 characters due to labor intensity.
Evaluation relies heavily on LLM-based metrics (GPT-4) and exact match of anchors, which may have biases.

Reproducibility

Code: https://github.com/Elvin-Yiming-Du/PerLTQA

Dataset and code are publicly available at https://github.com/Elvin-Yiming-Du/PerLTQA. The paper details the prompt templates used for data generation in the Appendix. The specific trained weights for the BERT classifier are not explicitly linked but code is provided.

📊 Experiments & Results

Evaluation Setup

Retrieval-Augmented Question Answering over a personalized memory database

Benchmarks:

PerLTQA (Personalized QA with Long-Term Memory) [New]

Metrics:

Precision, Recall, F1, Accuracy (for Classification)
Recall@K (for Retrieval)
Mean Average Precision (MAP) of Memory Anchors (for Synthesis)
Correctness (GPT-4 eval)
Coherence (GPT-4 eval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Memory Classification results demonstrate that specialized smaller models (BERT) outperform general-purpose LLMs in categorizing memory types.
PerLTQA	Accuracy	0.4025	0.9634	+0.5609
PerLTQA	Accuracy	0.3340	0.9634	+0.6294
Memory Retrieval experiments show the effectiveness of different retrievers on the PerLTQA dataset.
PerLTQA	Recall@10	0.584	0.767	+0.183
PerLTQA	Recall@10	0.627	0.767	+0.140
Memory Synthesis results comparing LLM performance when provided with gold-standard (Oracle) memories.
PerLTQA	Correctness (GPT-4 score 1-5)	4.17	4.82	+0.65

Main Takeaways

Specialized discriminative models (BERT) vastly outperform generative LLMs (ChatGPT, ChatGLM3) in the specific task of classifying memory types (semantic vs. episodic).
Contriever outperforms BM25 and DPR in retrieving personalized memories, suggesting dense retrieval is better suited for this diverse memory structure.
Even with perfect memory retrieval (Oracle), there is a performance gap between models, with ChatGPT showing superior synthesis capabilities compared to Llama-2 and others.

📚 Prerequisite Knowledge

Prerequisites

Distinction between Semantic and Episodic memory in cognitive science
Retrieval-Augmented Generation (RAG) pipelines
Basic understanding of LLM prompting and in-context learning

Key Terms

Semantic Memory: Long-term memory involving facts and world knowledge, specifically profiles and social relationships in this paper

Episodic Memory: Long-term memory involving personal experiences, specifically events and historical dialogues in this paper

Memory Anchor: A key text segment within a reference answer that directly aligns with the specific memory required to answer the question, used for precise evaluation

In-context Learning: A prompting technique where the model is given examples or context within the input to guide its generation without updating weights

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

MAP: Mean Average Precision—a metric used to evaluate the correctness of the generated memory anchors

Recall@K: The proportion of relevant items found in the top-K retrieved results