SGMem: Sentence graph memory for long-term conversational agents

📝 Paper Summary

Memory recall Memory organization

SGMem organizes dialogue history as a sentence-level graph to retrieve coherent context across turns, rounds, and sessions without relying on expensive LLM-based entity extraction.

Core Problem

Existing long-term memory methods suffer from fragmentation where relevant information is dispersed across coarse-grained units (turns/sessions) and generated summaries, making it difficult to retrieve precise, coherent context.

Why it matters:

Coarse retrieval (whole sessions/turns) includes irrelevant noise that distracts the LLM
Generated memories (summaries/facts) often lose fine-grained details needed for specific questions
Entity-centric graph methods are computationally expensive and discard non-entity contextual information

Concrete Example: If a user asks about a specific detail mentioned in a long conversation, a session-level retriever might pull the entire 50-turn session (too much noise), while a summary-based retriever might miss the specific detail entirely because it was compressed out.

Key Novelty

Sentence Graph Memory (SGMem)

Decomposes dialogue into sentences (atomic units) and links them via semantic similarity edges, creating a graph that connects related statements across different timeframes
Jointly indexes and retrieves raw dialogue sentences alongside generated memories (summaries, facts, insights) to combine precision with high-level understanding
Uses a lightweight graph construction (NLTK segmentation + embedding similarity) rather than expensive LLM-based entity-relation extraction

Architecture

Overview of SGMem framework, split into Construction & Management (left) and Usage (right)

Evaluation Highlights

Outperforms strong baselines (including LightRAG and MemoryScope) on LongMemEval and LoCoMo benchmarks in accuracy
Demonstrates consistent accuracy gains across single-hop, multi-hop, and temporal reasoning question types compared to turn/round/session-based retrieval
Ablation studies show that integrating all memory types (sentences + summaries + facts + insights) yields the highest performance

Breakthrough Assessment

7/10

Offers a practical, lightweight alternative to complex entity-graph memories by using sentence graphs. While not a fundamental architectural shift in LLMs, it significantly improves RAG precision for long conversations.

⚙️ Technical Details

Problem Definition

Setting: Long-term conversational question answering (QA) over a sequence of sessions containing multiple turns and rounds

Inputs: A query q and a history of sessions S = {s1, ... sU}

Outputs: A generated response ŷ conditioned on relevant retrieved context

Pipeline Flow

Group: Memory Construction: Processing Conversations → Indexing → Graph Construction → Storage
Group: Memory Usage: Retrieve Memory/Sentences → Rank Chunks via Graph → Collect Context → Personalized Generation

System Modules

Conversation Processor (Memory Construction)

Decompose sessions into rounds, turns, and sentences; generate summaries, facts, and insights via LLM

Model or implementation: NLTK (for segmentation) + LLM (for generation)

Indexer (Memory Construction)

Encode all memory units into vectors for similarity search

Model or implementation: Sentence-BERT

Graph Constructor (Memory Construction)

Build the sentence graph by linking chunks to sentences and creating sentence-sentence similarity edges

Model or implementation: k-Nearest Neighbor (KNN) algorithm

Retriever & Ranker (Memory Usage)

Retrieve initial candidates via vector search, expand via graph traversal, and rank parent chunks

Model or implementation: ElasticSearch (vector) + Neo4j (graph traversal)

Generator (Memory Usage)

Generate the final response using the aggregated context

Model or implementation: LLM (Base model not specified, likely generic instruction-tuned LLM)

Novel Architectural Elements

Dual-storage architecture combining 7 vector index tables with a sentence-level KNN graph for multi-hop retrieval
Hierarchical mapping mechanism that retrieves fine-grained sentences, expands via graph, and aggregates back to parent chunks (turn/round/session) for context

Modeling

Base Model: Sentence-BERT (for embeddings), specific generation LLM not specified in text provided

Training Method: Inference-only RAG framework (no training of the LLM reported)

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemoryBank/LD-Agent: SGMem uses fine-grained sentence graphs rather than just hierarchical summaries, preventing detail loss
vs. LightRAG: SGMem builds graphs on sentences (semantic units) rather than extracted entities, avoiding the cost and errors of LLM-based entity extraction
vs. GraphRAG: SGMem focuses on retrieving sequential dialogue context rather than abstractive community summaries [not cited in paper]

Limitations

Dependency on the quality of the embedding model (Sentence-BERT) for graph construction
Retrieval latency may increase with graph size due to multi-hop traversal
No specific handling of conflicting information mentioned (e.g., if a new sentence contradicts an old one)
Performance depends on the accuracy of the initial generated memories (summaries/facts) produced by the LLM

Reproducibility

Code availability is not provided in the paper text. The method relies on standard tools (NLTK, Sentence-BERT, ElasticSearch, Neo4j) which aids replication, but specific hyperparameters for graph construction (k, threshold gamma) and prompts for generated memory are necessary for exact reproduction.

📊 Experiments & Results

Evaluation Setup

Long-term conversational QA on two benchmarks

Benchmarks:

LongMemEval (Personal assistant memory skills (extraction, reasoning, updates))
LoCoMo (Very long persona-grounded conversations (up to 35 sessions))

Metrics:

Accuracy (determined by LLM-as-a-Judge)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SGMem demonstrates superior performance compared to baselines on the LongMemEval benchmark.
LongMemEval	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper
SGMem demonstrates superior performance compared to baselines on the LoCoMo benchmark.
LoCoMo	Accuracy	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

SGMem consistently outperforms strong baselines (including LightRAG and MemoryScope) in long-term conversational QA tasks
Using fine-grained sentence retrieval combined with generated memory (summaries, facts, insights) yields better results than using coarse granularity (sessions/rounds) alone
The sentence graph structure effectively mitigates memory fragmentation by linking related concepts across different dialogue sessions

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) pipelines
Vector databases and dense retrieval
Graph data structures (nodes, edges, traversal)

Key Terms

memory fragmentation: The dispersion of relevant information across different storage formats (raw logs vs. summaries), making it hard to retrieve a complete picture

turn: A single exchange (user input + assistant response) in a dialogue

round: A grouping of turns representing a coherent exchange or topic within a session

generated memory: Information synthesized by an LLM from raw logs, such as summaries, extracted facts, or reflective insights

KNN graph: k-Nearest Neighbors graph, where nodes are connected to their k most similar peers based on vector embedding similarity

NLTK: Natural Language Toolkit—a standard Python library used here for splitting text into sentences without using a heavy neural model

Sentence-BERT: A modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

LLM-as-a-Judge: An evaluation method where a strong LLM (like GPT-4) is prompted to score the accuracy of a model's response against a reference answer