On memory construction and retrieval for personalized conversational agents

📝 Paper Summary

Memory recall Memory organization

SeCom improves long-term conversational memory by segmenting history into topically coherent units and compressing them to remove redundancy before retrieval, outperforming turn-level and session-level approaches.

Core Problem

Existing retrieval-augmented generation methods use suboptimal memory granularities: turn-level retrieval misses dispersed context, while session-level retrieval introduces irrelevant noise.

Why it matters:

Turn-level retrieval often fails when keywords are missing from a specific historical turn, leading to fragmentary context
Session-level retrieval brings in extraneous topics (e.g., a chat about World War II mixed with a chat about probability), distracting the LLM
Summarization-based memory often suffers from information loss and hallucination, degrading response quality in long-term interactions

Concrete Example: User asks 'What is the answer?' referring to a probability question asked 5 turns ago. Turn-level retrieval misses the original question because the current turn lacks keywords. Session-level retrieval pulls the entire chat, including an irrelevant debate about World War II, confusing the model.

Key Novelty

Segment-Level Memory with Compression-Based Denoising (SeCom)

Partitions long conversations into 'segments'—topically coherent blocks larger than a turn but smaller than a session—using an LLM-based segmenter refined via self-reflection
Applies prompt compression (LLMLingua-2) to these segments to remove inherent natural language redundancy, acting as a denoising step that improves retrieval accuracy without information loss

Architecture

Overview of the SeCom framework: Segmentation → Compression → Retrieval → Generation

Evaluation Highlights

SeCom outperforms turn-level baselines by +4.8% and session-level by +8.2% on LOCOMO benchmark (averaged across metrics)
Achieves superior retrieval recall (Hit@1) compared to raw segments when compression rate is >50% using LLMLingua-2
Segmentation model surpasses baselines on DialSeg711 with a +5.7 improvement in WindowDiff score

Breakthrough Assessment

7/10

Strong empirical evidence that segment-level granularity is the 'sweet spot' for memory. The combination with prompt compression as a denoiser is a clever, effective insight.

⚙️ Technical Details

Problem Definition

Setting: Long-term open-domain conversation requiring retrieval of relevant history for response generation

Inputs: Current user request u* and conversation history H consisting of multiple sessions

Outputs: Generated response r* based on retrieved memory context

Pipeline Flow

Conversation Segmentation (Pre-processing)
Memory Compression (Pre-processing)
Retrieval (Inference)
Generation (Inference)

System Modules

Conversation Segmenter (Memory Construction)

Split sessions into topically coherent segments

Model or implementation: GPT-4 (with self-reflection refinement) or Mistral-7B

Memory Denoiser (Memory Construction)

Remove redundancy from segments to create clean memory units

Model or implementation: LLMLingua-2

Retriever

Find relevant compressed segments for the current query

Model or implementation: Contriever or BM25

Generator

Generate response using retrieved context

Model or implementation: GPT-3.5-turbo-0125 or GPT-4-turbo

Novel Architectural Elements

Pipeline structure that defines 'Memory Unit' strictly at the 'Topical Segment' level rather than turn or session
Integration of prompt compression (LLMLingua-2) specifically as a *retrieval denoiser* rather than just a context window saver

Modeling

Base Model: GPT-4 (for segmentation and generation), Mistral-7B (alternative segmenter)

Training Method: Prompt Optimization via Self-Reflection (In-context learning / Prompt Engineering)

Objective Functions:

Purpose: Optimize the segmentation prompt to minimize segmentation error.

Formally: Iterative update G_{m+1} = G_m - eta * grad(L(G_m)), where gradient is approximated by LLM reflection on high-error examples (WindowDiff).

Training Data:

Used DialSeg711, TIAGE, and SuperDialSeg for segmentation evaluation
Used LOCOMO and Long-MT-Bench+ for RAG evaluation

Key Hyperparameters:

retrieval_budget_N: Varied (e.g., top-1, top-3)
compression_rate: Analyzed at various levels (e.g., 20% to 80%)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemoryBank: SeCom captures full topical context (segments) rather than fragmented turns
vs. Session-level: SeCom avoids retrieving irrelevant topics contained within the same session
vs. Summarization: SeCom avoids information loss inherent in summarization by using segmentation + compression
+ 2 more
vs. MemWalker [not cited in paper]: MemWalker navigates a memory tree; SeCom uses flat segmentation and dense retrieval
vs. SILO [not cited in paper]: SILO manages long context via strict windowing; SeCom dynamically retrieves past segments

Limitations

Relies on the performance of the segmentation model; poor segmentation degrades downstream retrieval
Compression might remove subtle but necessary details if the rate is too high
Evaluation primarily focuses on QA performance, less on conversational flow or personality retention

Reproducibility

Code: https://github.com/microsoft/SeCom

Code is publicly available at https://github.com/microsoft/SeCom. The paper uses standard benchmarks (LOCOMO, DialSeg711) and commercial APIs (GPT-4), ensuring relatively high reproducibility.

📊 Experiments & Results

Evaluation Setup

Long-context conversation QA and Conversation Segmentation

Benchmarks:

LOCOMO (Long-term conversation understanding (QA))
Long-MT-Bench+ (Long-term conversation chat)
DialSeg711 (Conversation Segmentation)
TIAGE (Conversation Segmentation)

Metrics:

Rouge-L
WindowDiff (WD)
Pk
Hit@1 (Retrieval Recall)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SeCom demonstrates superior segmentation performance compared to existing unsupervised and supervised baselines.
DialSeg711	WindowDiff (lower is better)	0.468	0.222	-0.246
DialSeg711	WindowDiff (lower is better)	0.252	0.222	-0.030
Downstream QA performance on LOCOMO shows SeCom's advantage over other memory granularities.
LOCOMO	Rouge-L	20.1	22.3	+2.2
LOCOMO	Rouge-L	19.8	22.3	+2.5
LOCOMO	Rouge-L	20.5	22.3	+1.8

Experiment Figures

Impact of compression on retrieval recall (Recall@K) and similarity scores

Main Takeaways

Granularity matters: Segment-level memory consistently outperforms turn-level (too fragmented) and session-level (too noisy) memory in retrieval tasks.
Compression as Denoising: Removing redundancy via LLMLingua-2 improves retrieval recall (Hit@1) by making the semantic match between query and document sharper, contrary to the intuition that compression loses info.
Self-Reflection Works: The iterative self-reflection method allows the segmentation prompt to improve significantly, surpassing supervised baselines like CSeg on DialSeg711.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Prompt Compression / Token Reduction
Text Segmentation metrics (WindowDiff, Pk)

Key Terms

SeCom: The proposed method: Segmenting and Compressing conversations for memory

WindowDiff: A metric for evaluating segmentation accuracy; measures how often a sliding window contains the same number of boundaries in the hypothesis and reference (lower is better)

LLMLingua-2: A prompt compression method used here to remove redundant tokens from memory units before retrieval

Pk: A segmentation evaluation metric representing the probability that two sentences drawn k distance apart are incorrectly segmented

RHO: Correlation coefficient used to measure agreement between metrics

Rouge-L: A metric measuring the longest common subsequence between generated text and reference text