EMG-RAG: Crafting Personalized Agents through RAG on Editable Memory Graphs

📝 Paper Summary

Tree/graph-baesd memory Layered memory Modularized RAG pipeline

EMG-RAG combines an editable memory graph with a reinforcement learning agent to adaptively select and update personal user memories for smartphone assistants.

Core Problem

Existing personalized agents struggle to handle dynamic smartphone data that requires frequent editing (insertion, deletion, replacement) and the selection of complex memory combinations for accurate retrieval.

Why it matters:

Personal data on devices is dynamic; static databases cannot handle time-sensitive deletions (e.g., expired vouchers) or updates (e.g., flight rescheduling).
Standard top-k retrieval often fails when a query requires aggregating multiple distinct memories (e.g., flight number + time + passenger) rather than just semantic similarity.
Current 'Needle in a Haystack' approaches overwhelm LLM context windows with irrelevant noise, degrading performance on specific personal queries.

Concrete Example: A user asks about a 'secretary's boss's flight.' Standard retrieval might miss the connection. EMG-RAG links: (1) Secretary booked flight to Amsterdam, (2) Flight is EK349, (3) EK349 departs 01:40. It combines these specifically.

Key Novelty

Editable Memory Graph (EMG) with RL-driven Traversal

Organizes memories into a three-layer structure (Type, Subclass, Graph) that supports efficient partition-based editing (insert, delete, replace) of dynamic personal data.
Replaces standard vector retrieval with a Reinforcement Learning agent that traverses the memory graph, learning to select the optimal combination of nodes to maximize answer quality.

Architecture

The complete EMG-RAG pipeline including Graph Construction, Editing, Retrieval via MDP, and Downstream Applications.

Evaluation Highlights

Outperforms M-RAG baseline by ~10.6% in R-L score for question answering after 4 weeks of continuous memory edits.
Achieves 96.99% Exact Match in autofill forms after 4 weeks, surpassing the best baseline by ~9.5%.
Online A/B testing showed a 4.5% improvement in Question Answering quality over the previous system.

Breakthrough Assessment

7/10

Strong practical application of graph-based memory for dynamic personalization. The handling of continuous edits is a significant improvement over static RAG, though the core RL method is relatively standard.

⚙️ Technical Details

Problem Definition

Setting: Personalized agent task on smartphone data requiring Editability (insert/delete/replace) and Selectability (complex retrieval).

Inputs: User query Q and a dynamic stream of personal memories M (conversations, screenshots).

Outputs: Generated answer A grounded in selected memories.

Pipeline Flow

Data Collection & Graph Construction (GPT-4 parses raw data into EMG)
Graph Editing (Insert/Delete/Replace memories based on new data)
Graph Traversal (RL agent selects memories starting from activated nodes)
Generation (LLM produces answer using selected memories)

System Modules

Editable Memory Graph (EMG)

Stores memories in a 3-layer hierarchy (Type -> Subclass -> Graph) to facilitate partition-based editing.

Model or implementation: TransE for node/class alignment

Node Activator (Retrieval & Selection)

Identifies starting points for graph traversal to avoid searching the entire graph.

Model or implementation: CPT-Text embeddings

RL Selection Agent (Retrieval & Selection)

Traverses the graph from activated nodes to select the optimal set of memories.

Model or implementation: 2-layer NN (Hidden=20, tanh)

Generator

Generates the final response using the query and selected memories.

Model or implementation: GPT-4 / ChatGLM3-6B / PanGu-38B

Novel Architectural Elements

Three-layer hierarchy (Memory Type, Memory Subclass, Memory Graph) specifically designed for partitioned editing.
Integration of RL-based graph traversal directly on top of an editable personal memory graph.

Modeling

Base Model: GPT-4 (primary), ChatGLM3-6B, PanGu-38B

Training Method: Reinforcement Learning (REINFORCE algorithm)

Objective Functions:

Purpose: Pre-train the selection agent to recognize relevant memories.

Formally: Binary cross-entropy L_WS = -y*log(P) + (y-1)*log(1-P).
Purpose: Optimize the policy to maximize the quality of the final generated answer.

Formally: L_PG = -R_N * ln(pi_theta(a|s)), where R_N is the cumulative reward based on metric improvement (Delta ROUGE/BLEU).

Trainable Parameters: RL agent parameters (2-layer MLP)

Training Data:

2,000 users for training, 500 for testing (sampled from 11.35 billion raw logs)
GPT-4 generated QA pairs and 'required memories' labels for supervision

Key Hyperparameters:

learning_rate: 0.001
reward_discount: 0.99
K_activated_nodes: 3
+ 2 more
warm_start_episodes: 1000
policy_gradient_episodes: 100

Compute: Inference times for different K reported (1.35s to 3.32s). Training hardware not explicitly reported.

Comparison to Prior Work

vs. NiaH: EMG-RAG selects specific graph nodes rather than flooding context, reducing noise.
vs. M-RAG: EMG-RAG supports graph-based structural traversal and explicit editing operations (Insert/Delete/Replace), whereas M-RAG focuses on static database partitioning.
vs. Keqing: EMG-RAG is designed for personal memory editing and RL-based selection, rather than static KG decomposition.

Limitations

Training efficiency is low because it requires querying the LLM to calculate rewards during the RL process.
Relies on GPT-4 for data generation, introducing a cold-start distribution shift when moving to real user queries.
Evaluation is performed on proprietary business data, limiting direct reproducibility.
Inference time increases linearly with the parameter K (activated nodes).

Reproducibility

Code: https://github.com/zilliztech/GPTCache

Code for GPTCache is linked, but the specific EMG-RAG model code is not provided. Dataset is proprietary business data (11.35 billion raw text) from a real AI assistant product and is not released. GPT-4 used for data generation.

📊 Experiments & Results

Evaluation Setup

Real-world business dataset from an AI assistant (2000 train/500 test users). Tasks: QA, Autofill Forms, User Services.

Benchmarks:

Personalized QA (Proprietary) (Question Answering based on memory) [New]
Autofill Forms (Proprietary) (Entity extraction for form filling) [New]
User Services (Proprietary) (Reminder and Travel navigation services) [New]

Metrics:

ROUGE-1/2/L
BLEU
Exact Match (EM)
Statistical methodology: T-test with p < 0.05 reported for significance.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
EMG-RAG consistently outperforms baselines on standard QA tasks using GPT-4.
Personalized QA	BLEU	64.16	75.99	+11.83
Personalized QA	ROUGE-L	84.74	88.06	+3.32
In scenarios with continuous memory updates (weeks 1-4), EMG-RAG shows superior robustness due to its editable graph structure.
Personalized QA	ROUGE-L	86.39	96.93	+10.54
Autofill Forms	Exact Match	88.89	95.83	+6.94
Ablation studies confirm the necessity of both the Warm Start (WS) and Policy Gradient (PG) training stages.
Personalized QA	BLEU	65.65	75.99	+10.34
Personalized QA	BLEU	65.07	75.99	+10.92

Main Takeaways

Graph-based memory management significantly handles dynamic data (insert/delete/replace) better than static database approaches (M-RAG), maintaining high performance over weeks of edits.
RL-based selection effectively filters noise compared to 'Needle in a Haystack' approaches, which suffer from context window overload.
The approach generalizes well across different LLM backbones (GPT-4, ChatGLM, PanGu), consistently beating baselines.
Online learning (A/B test) further boosts performance by ~3-4%, mitigating the distribution shift between synthetic training data and real user queries.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Knowledge Graphs (Entity/Relation extraction)
Reinforcement Learning (MDP, Policy Gradient)
TransE embeddings

Key Terms

EMG: Editable Memory Graph—a three-layer data structure (Type, Subclass, Graph) designed to organize and efficiently edit personal user memories.

MDP: Markov Decision Process—a mathematical framework used here to model the agent's step-by-step traversal of the memory graph to select relevant nodes.

TransE: A method for embedding knowledge graphs that models relationships as translations in vector space, used here to link memory subclasses to entity nodes.

Top-K: A retrieval strategy selecting the K most similar items; here, it is used only to identify starting nodes (activated nodes) for the graph traversal.

Cold-start: The problem where a system lacks sufficient data to perform well initially; addressed here by pre-training on GPT-4 generated data before online fine-tuning.

REINFORCE: A specific policy gradient algorithm used to optimize the RL agent's memory selection policy based on the quality of the final generated answer.