Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics

📝 Paper Summary

Memory organization Memory recall Layered memory

This paper establishes a unified taxonomy for agent memory, categorizing it by representation (parametric vs. contextual) and defining six core operations—encoding, evolving, and adapting—to organize current research and benchmarks.

Core Problem

Existing surveys on LLM agent memory focus on high-level applications (like personalization) or specific subtopics (like long-context modeling) without a unified framework defining atomic operations or structural foundations.

Why it matters:

Lack of a unified framework fragments research, making it difficult to understand how different memory mechanisms (e.g., KV cache eviction vs. knowledge graph storage) relate or interact
Current literature overlooks the complete memory lifecycle, focusing often on retrieval while neglecting critical operations like consolidation, forgetting, and condensation
Developers lack structured guidance on selecting appropriate memory types (parametric vs. contextual) and operations for building robust, long-term capable agents

Concrete Example: Current benchmarks reveal a disconnect: models achieve >90 Recall@5 on retrieval tasks (e.g., 2Wiki) but lag by >30 points in generation metrics (F1), indicating that high retrievability does not guarantee effective memory utilization due to poor condensation or reasoning.

Key Novelty

Operational Taxonomy for Agent Memory

Formalizes memory into two representations: Parametric (implicit in weights) and Contextual (explicit external data), bridging the gap between model fine-tuning and RAG
Defines six atomic operations governing the memory lifecycle: Consolidation (writing), Indexing (organizing), Updating (modifying), Forgetting (removing), Retrieval (accessing), and Condensation (compressing)
Introduces the Relative Citation Index (RCI) to analyze research trends, normalizing citation counts by publication age to identify emerging high-impact topics like KV cache optimization

Architecture

A unified framework of memory in LLM-based agents, mapping Taxonomy (Structured/Unstructured, Parametric/Contextual), Operations (Encoding, Evolving, Adapting), and High-Impact Topics.

Evaluation Highlights

Analysis of >30,000 papers reveals a gap between retrieval and generation: retrieval recall is often >90% while generation F1 scores drop to ~60% on benchmarks like LoCoMo and MemoryBank
Identifies that long-term memory benchmarks (e.g., LoCoMo) span 20-30 turns but largely ignore dynamic operations like forgetting or updating, focusing instead on static QA
Demonstrates via RCI analysis that 'KV cache eviction' and 'context compression' are rapidly growing high-impact topics within the long-context memory domain

Breakthrough Assessment

9/10

Provides a highly necessary, comprehensive framework that unifies disparate memory research (RAG, long-context, model editing) into a single operational taxonomy, significantly clarifying the field.

⚙️ Technical Details

Problem Definition

Setting: Systematic survey and taxonomy construction for Memory in LLM-based Agents

Inputs: Corpus of over 30,000 papers from top venues (NeurIPS, ICLR, ACL, etc.) published between 2022-2025

Outputs: A unified taxonomy of memory types, functional categories, and six core operations, plus a curated set of 3,923 high-relevance papers

Pipeline Flow

Memory Representation (Parametric vs. Contextual)
Memory Operations (Encoding → Evolving → Adapting)
High-Impact Topics Application

System Modules

Memory Encoding (Operations)

Transform information into storable representations

Model or implementation: N/A (Conceptual Framework)

Memory Evolving (Operations)

Dynamically change stored information over time

Model or implementation: N/A (Conceptual Framework)

Memory Adapting (Operations)

Access and utilize memory during inference

Model or implementation: N/A (Conceptual Framework)

Novel Architectural Elements

Unified operational framework defining 'Condensation' and 'Forgetting' as first-class citizens alongside Retrieval and Storage
Formal classification of KV Cache Eviction as a specific sub-type of Contextual Memory operations

Comparison to Prior Work

vs. Zhang et al. [367]: Defines atomic operations (Indexing, Forgetting) missing in prior work and splits memory into parametric/contextual types
vs. RAG Surveys: Integrates 'Parametric Memory Modification' (editing/unlearning) as a parallel to RAG, offering a holistic view of agent knowledge
vs. Long-Context Surveys: Explicitly categorizes KV cache management as a short-term memory operation within a larger agentic framework

Limitations

The survey's empirical analysis is limited to the provided datasets and may not cover every proprietary industrial system
The definition of 'forgetting' in agents is still nascent and lacks standardized benchmarks compared to retrieval
Evaluation of memory operations like consolidation and updating is sparse in current benchmarks, which heavily favor static retrieval QA

Reproducibility

Code: https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI

Publicly available: The authors released the list of papers, datasets, and tools at https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI. The survey methodology (RCI metric, GPT-4o-mini filtering) is described in detail.

📊 Experiments & Results

Evaluation Setup

Meta-analysis of existing literature and benchmarks rather than a single model evaluation

Benchmarks:

LoCoMo (Long-Context/Long-Term Memory Dialogue)
MemoryBank (Memory Updating and Retrieval)
LongBench (Long-Context Understanding)

Metrics:

Relative Citation Index (RCI)
Recall@k (Retrieval)
F1 / BLEU / ROUGE-L (Generation)
Statistical methodology: Log-log regression model used for RCI calculation with R^2=0.97 fit

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of current benchmarks reveals a significant performance gap between retrieving relevant memories and effectively generating answers based on them.
2Wiki / MemoryBank	Recall@5 vs F1 Gap	90	60	-30
RCI analysis highlights the growing importance of efficient context processing topics.
Publication Corpus	RCI (Impact)	N/A	High RCI	N/A

Experiment Figures

Detailed mapping of specific papers and methods to the taxonomy of memory operations and research topics.

Main Takeaways

High retrievability does not guarantee effective generation; the 'retrieval-generation gap' is a major bottleneck caused by poor condensation and temporal reasoning.
Current benchmarks are static: they test QA accuracy but largely ignore the dynamic lifecycle of memory (updating, forgetting, consolidating) over long horizons.
Memory is not just storage; 'Parametric Memory Modification' (editing weights) and 'Contextual Memory' (RAG/Cache) are converging into hybrid systems.
Personalization remains difficult due to the tension between specializing a model (via memory/adapters) and retaining general pre-trained capabilities.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and their inference process
Familiarity with Retrieval-Augmented Generation (RAG) workflows
Basic knowledge of cognitive science memory concepts (episodic, semantic, procedural)

Key Terms

Parametric Memory: Knowledge implicitly stored within the model's neural network weights, acquired during training

Contextual Memory: Explicit external information (text, databases, KV cache) provided to the model during inference

KV Cache: Key-Value cache—temporary storage of intermediate token representations during inference to speed up generation (short-term memory)

Consolidation: The process of transforming short-term experiences/observations into persistent long-term storage

RCI: Relative Citation Index—a metric used in this survey that normalizes citation counts by publication age to compare impact across different years

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Episodic Memory: Storage of temporally anchored experiences, such as dialogue histories and event sequences

Semantic Memory: Storage of facts and general knowledge, often in knowledge graphs or model parameters

Procedural Memory: Memory of how to perform tasks or use tools, often implicit in trained weights or explicit in stored trajectories

Working Memory: A dynamic control mechanism integrating short-term caches and activated long-term knowledge for real-time reasoning

KV Cache Eviction: Techniques to selectively remove less important tokens from the KV cache to manage memory footprint in long-context tasks