MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants

📝 Paper Summary

Memory recall Evaluation methodology

MemSim creates reliable, diverse, and scalable memory evaluation datasets for LLM agents by simulating user profiles via a Bayesian network and generating causally consistent messages and questions.

Core Problem

Evaluating the long-term memory of LLM personal assistants is difficult because existing methods lack scalability (human annotation) or reliability (LLM-generated datasets suffer from hallucinations).

Why it matters:

Manual annotation of user-agent message histories is expensive and unscalable
Existing LLM-generated datasets often contain factual errors (ground truth correctness <90% generally, <40% in complex cases)
Standard LLM generation lacks diversity, tending to produce only the most plausible user profiles

Concrete Example: If a simulator generates a user message 'I am 30 years old', but then hallucinates an answer '25' to the question 'How old am I?', the evaluation dataset becomes invalid. Vanilla LLM generation frequently fails on aggregative questions like 'How many people are under 35?' due to such hallucinations.

Key Novelty

Bayesian-Causal Data Synthesis

Uses a Bayesian Relation Network (BRNet) to model probabilistic dependencies between user attributes (e.g., age influences occupation), ensuring diverse and consistent profile sampling
Separates the 'truth' generation (sampling structured hints from BRNet) from the 'text' generation (LLM rewriting), preventing the LLM from hallucinating facts during dataset creation

Architecture

The MemSim pipeline: (a) Bayesian Relation Network for profile sampling, (b) Causal Generation Mechanism for creating consistent messages and QAs, (c) The resulting MemDaily dataset structure.

Evaluation Highlights

Generated MemDaily dataset achieves >99% correctness on ground truth answers, significantly outperforming vanilla LLM generation (which drops below 40% on complex tasks)
Maintains high diversity in user profiles compared to vanilla LLM methods, which suffer from mode collapse
Provides a benchmark showing current LLMs (e.g., GPT-4) still struggle with aggregative and multi-hop memory retrieval tasks

Breakthrough Assessment

8/10

Addresses a critical bottleneck in agent evaluation (data reliability) with a theoretically grounded method (Bayesian networks). The resulting dataset appears highly robust compared to pure LLM generation.

⚙️ Technical Details

Problem Definition

Setting: Automatic generation of evaluation datasets (User Messages, Questions, Answers, Ground Truth Indices) for assessing memory capabilities of LLM agents

Inputs: Scenario schema (entities and attributes defined in a Bayesian Network)

Outputs: A set of trajectories ξ = (M, q, a, a', h), where M is user messages, q is a question, a is the answer, and h is the supporting evidence

Pipeline Flow

Profile Generation: BRNet Sampling → Hints Construction
Content Generation: Hints → LLM Rewriting → User Messages & QA Pairs

System Modules

Bayesian Relation Network (BRNet) (Profile Generation)

Model and sample consistent, diverse hierarchical user profiles (entities and attributes)

Model or implementation: Probabilistic Graphical Model (DAG)

Hint Constructor (Profile Generation)

Selects specific attribute values from the profile to serve as the 'ground truth' facts

Model or implementation: Rule-based selection

Message Generator (Content Generation)

Rewrites structured hints into natural language user messages

Model or implementation: LLM (Specific model not fixed, uses prompting)

QA Constructor (Content Generation)

Generates questions and answers based on the selected hints and QA type (e.g., Multi-hop, Comparative)

Model or implementation: LLM (Specific model not fixed, uses prompting)

Novel Architectural Elements

Causal generation mechanism: Deriving both natural language messages and ground truth QAs from a shared intermediate 'Hint' representation to guarantee consistency
Hierarchical profile modeling via BRNet to enforce dependency constraints between attributes during synthetic user creation

Reproducibility

Code: https://github.com/nuster1128/MemSim

Project released at https://github.com/nuster1128/MemSim. The paper details the MemDaily dataset construction (11 entities, 73 attributes). Specific LLM used for the generator in experiments is not explicitly named in the main text but implies a capable model like GPT-4 or similar is used for the 'LLM rewriting' steps.

📊 Experiments & Results

Evaluation Setup

Benchmarking memory mechanisms on the synthetic 'MemDaily' dataset.

Benchmarks:

MemDaily (Long-term memory QA) [New]

Metrics:

Accuracy (Correctness of generated dataset)
Diversity (Distinct N-grams)
Prediction Accuracy (of memory models)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of the reliability (correctness) of the generated dataset itself compared to baselines.
MemDaily (Reliability)	Correctness (%)	36.0	100.0	+64.0
MemDaily (Reliability)	Correctness (%)	86.0	99.0	+13.0
Benchmarking of various memory mechanisms using the generated MemDaily dataset.
MemDaily (Simp.)	Accuracy	56.3	93.3	+37.0
MemDaily (Aggr.)	Accuracy	39.0	55.0	+16.0

Main Takeaways

Vanilla LLMs severely hallucinate when generating complex evaluation datasets (e.g., aggregative QAs), validating the need for MemSim's structured approach.
Existing memory mechanisms (including RAG and MemoryBank) struggle significantly with 'Aggregative' and 'Post-processing' tasks, often achieving <60% accuracy even with strong backbones like GPT-4.
MemSim successfully generates diverse user profiles, avoiding the 'average user' bias common in direct LLM sampling.
The dataset construction separates the 'logic' (hints) from the 'expression' (text), preventing conflicts between the message history and the ground truth answers.

📚 Prerequisite Knowledge

Prerequisites

Bayesian Networks (DAGs, ancestral sampling)
Large Language Models (hallucination, prompting)
Retrieval-Augmented Generation (RAG)

Key Terms

BRNet: Bayesian Relation Network—a directed acyclic graph modeling probabilistic dependencies between user attributes (e.g., age → income)

Ancestral Sampling: A method to sample values from a Bayesian network by traversing the graph from parent nodes to children, respecting conditional probabilities

Aggregative QA: Questions requiring the agent to retrieve multiple pieces of information and perform a calculation (e.g., 'How many friends live in Paris?')

TPM: Tokens Per Message—a metric indicating the length/verbosity of generated user messages

Hallucination: In this context, when an LLM generates a ground truth answer for a dataset that contradicts the generated user messages

Multi-hop QA: Questions that cannot be answered by a single message but require connecting information across multiple messages (e.g., bridging entities)