📖 What is Memory-Augmented LLMs?
Memory in LLMs addresses how AI systems store, organize, retrieve, and evolve information across interactions to enable personalization, long-term coherence, and learning from experience.
💡 Why it Matters
As LLMs evolve from stateless single-turn tools to persistent agents managing long-running tasks, effective memory becomes critical for maintaining coherence across sessions, personalizing responses to individual users, and enabling agents to learn from accumulated experience without costly retraining.
🎯 Key Paradigms
Research on structuring and storing past interactions and knowledge for LLM agents, covering linear buffers with learned eviction, multi-layered hierarchies inspired by human cognition, tree/graph-based associative structures, parameter internalization via LoRA adapters, and compression through summarization and consolidation.
Methods for retrieving information from past interactions and logs to answer users' recall questions and help them remember multi-modal memories, spanning sparse and dense memory QA, conversational memory retrieval, multi-modal recall, and temporal-episodic reconstruction.
Memory systems designed specifically for LLM-based agents, enabling persistent state across sessions, experience replay and reflection for continuous improvement, memory-augmented planning, coordinated memory sharing across multi-agent teams, and formal evaluation of agent memory effectiveness.
📚 Related Fields
- Agentic AI — see the comprehensive summary
- Retrieval-Augmented Generation (RAG) — see the comprehensive summary
📅 Field Evolution Timeline
Foundational memory-augmented architectures, OS-inspired KV cache management, and early cognitive memory models
- End-to-End Memory Networks (MemN2N, 2015) introduced fully differentiable multi-hop attention over external memory, establishing the paradigm of differentiable memory access that influenced all subsequent memory-augmented architectures.
- Transformer Feed-Forward as Key-Value Memory (FFN-as-KV, 2021) reinterpreted two-thirds of Transformer parameters as key-value memory stores, providing the theoretical foundation for knowledge editing and memory manipulation research.
- PagedAttention (vLLM, 2023) revolutionized KV cache management by applying OS virtual memory paging concepts to GPU memory, achieving 2–4× throughput improvement and becoming the standard for production LLM serving.
- MemGPT (MemGPT, 2023) pioneered treating the LLM context window as RAM with external storage as disk, enabling self-directed memory paging and inspiring the OS-inspired memory paradigm adopted by subsequent systems.
Systematic taxonomies for agent memory, neuroscience-inspired retrieval, memory-efficient training breakthroughs, and first long-term benchmarks
- The first comprehensive agent memory survey (Memory Survey, 2024) established a unified taxonomy deconstructing memory into Sources, Forms, and Operations, providing shared vocabulary for the fragmented field.
- HippoRAG (HippoRAG, 2024) introduced neurobiologically-inspired graph retrieval using knowledge graphs with Personalized PageRank, achieving 20% gains on multi-hop QA at 10–20× lower cost than iterative methods.
- GaLore (GaLore, 2024) democratized LLM pre-training by projecting gradients into low-rank subspaces, reducing optimizer memory by 65% and enabling 7B model training on a single 24GB consumer GPU.
- LoCoMo (LoCoMo, 2024) established the first very long-term conversational memory benchmark with 300+ turn dialogues, revealing that even frontier LLMs lag behind humans by 56–73% on memory tasks.
Reinforcement learning emerges as the dominant training paradigm for memory management, with breakthroughs in experience-based self-evolution, implicit personalization, and cross-framework knowledge sharing
- MACLA (MACLA, 2025) demonstrated that external procedural memory built in 56 seconds can outperform models 10× larger, achieving 78.1% average across four benchmarks through contrastive refinement of success/failure pairs.
- Memory-R1 (Memory-R1, 2025) was the first to apply GRPO-based RL to memory ADD/UPDATE/DELETE operations, achieving +28.5% F1 improvement on LoCoMo with only 152 training examples.
- PersonaMem-v2 (PersonaMem-v2, 2025) showed that RL-trained agentic memory enables a 4B model to outperform GPT-5 on implicit personalization while using 16× fewer tokens, establishing a new paradigm for efficient user modeling.
- AGENTKB (AGENTKB, 2025) created a universal cross-framework memory layer enabling knowledge transfer across incompatible agent architectures, with +18.7pp improvement on GAIA and +17.0pp on SWE-bench Lite.
Models learn to actively manage their own context, formal memory governance emerges, and evaluation reveals critical gaps between static recall and active memory-guided decisions
- StateLM (StateLM, 2026) introduced the Pensieve paradigm where models self-manage context via read-note-delete cycles, achieving 52% on deep research tasks where standard LLMs score only 5%.
- Pichay (Pichay, 2026) applied OS demand-paging principles to LLM context, reducing context consumption by 93% in production with only 0.025% page fault rate.
- MemoryArena (MemoryArena, 2026) revealed that agents with near-perfect static memory scores fail dramatically on interdependent multi-session tasks, fundamentally redefining how memory should be evaluated.
- Context Channel Capacity (CCC, 2026) proved an Impossibility Triangle: zero forgetting, online learning, and finite parameters cannot coexist for sequential state-based learners.
Memory Organization
What: Research on structuring, storing, and retrieving past interactions and knowledge for LLM-based agents, covering both inference-level memory management (KV cache) and cognitive-level memory design for recall QA and personalized conversation.
Why: As LLMs evolve from stateless single-turn tools to persistent agents with long-term interactions, effective memory organization becomes critical for maintaining coherence, personalization, and the ability to learn from experience without retraining.
Baseline: Conventional approaches either feed the entire conversation history into the prompt (computationally expensive and unscalable beyond context limits) or use flat vector similarity search over stored text chunks (shallow retrieval that misses implicit preferences and dispersed context).
- Deciding what to store, update, or delete without explicit supervision—most user preferences are expressed implicitly across many sessions
- Scaling retrieval precision as memory grows, since larger memory banks introduce more noise and irrelevant matches
- Balancing memory persistence with adaptability—agents must retain useful knowledge while overwriting outdated information when circumstances change
- Bridging the gap between low-level inference memory (KV cache management) and high-level cognitive memory (experiences, preferences, procedures)
🧪 Running Example
Baseline: A standard RAG system performs vector similarity search for 'restaurant anniversary' and retrieves recent mentions of restaurants from unrelated conversations (e.g., a lunch recommendation), missing the actual preference buried in a weekend-planning session from weeks ago.
Challenge: The preference was stated implicitly ('that new Italian place looked amazing for a special occasion'), never labeled as a preference, and surrounded by unrelated discussion topics. Simple similarity search lacks the depth to reconstruct this episodic memory.
📈 Overall Progress
Memory has evolved from a passive storage problem (KV cache paging) to an active cognitive capability where agents learn to manage their own memory through reinforcement learning and self-context engineering.
📂 Sub-topics
KV Cache & Inference Memory Management
10 papers
Methods for efficiently allocating, compressing, and retrieving Key-Value cache data during LLM inference, including paged memory, hierarchical indexing, and learned eviction policies.
Agent Memory Architecture & Taxonomies
8 papers
Surveys, taxonomies, and systematic evaluations of memory structures for LLM-based agents, including classification by form, function, and dynamics.
Experience-Driven Memory & Self-Evolution
12 papers
Methods that enable agents to accumulate, refine, and reuse experience across tasks through external memory, including RL-optimized memory management, procedural memory learning, and active context engineering.
Personalized & Conversational Memory
8 papers
Memory systems designed for long-term dialogue and user personalization, including dual-process retrieval, segment-level compression, persona management, and structured data selection.
Embodied & Domain-Specific Memory
8 papers
Memory architectures tailored for robotics, vision-language navigation, video generation, and clinical AI, where memory must encode spatial, temporal, or procedural information beyond text.
Parametric & Scalable Memory
5 papers
Approaches that embed memory directly into model parameters or use learned memory layers, including product-key memory, memory distillation, and continuous abstraction mechanisms.
💡 Key Insights
💡 Memory management is shifting from a passive storage problem to an active cognitive skill that agents can learn through reinforcement learning.
💡 Positional encoding matters more than semantic content for KV cache importance scoring—where a token appears outweighs what it contains.
💡 RL-trained memory policies generalize dramatically: models trained on 30k tokens perform well at 400k+ tokens without degradation.
💡 External procedural memory built in seconds can outperform models 10x larger by decoupling reasoning from adaptation.
💡 Segment-level memory granularity consistently outperforms both turn-level and session-level retrieval for long conversations.
💡 The field is converging on memory as the locus of agent identity—models are replaceable vessels, but memory persists and defines the self.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (2020-2023) focused on efficient memory allocation for inference. 2024 brought systematic taxonomies and learned compression methods. 2025 saw the rise of RL-optimized memory management and experience-based self-evolution. By early 2026, the field converges on agents that actively curate their own context, with deterministic retrieval replacing fuzzy search and identity-preserving architectures emerging.
- (PagedAttention, 2023) revolutionized KV cache management by applying OS virtual memory concepts to GPU memory, achieving 2-4x throughput improvement and near-zero waste
- (UMGR, 2020) pioneered memory graph reasoning for conversational recommendation, unifying offline user history and online dialog state in a single heterogeneous graph
- (Coop, 2023) co-optimized tensor allocation and rematerialization, reducing memory fragmentation to under 5% for large model training
- (InRecAgent, 2023) introduced a shared Candidate Bus memory for recommendation agents, replacing expensive in-context item lists with external storage
- (Memory Survey, 2024) established a unified taxonomy deconstructing memory into Sources, Forms, and Operations
- vAttention (vAttention, 2024) replaced PagedAttention's non-contiguous design with GPU virtual memory mapping, improving throughput by up to 1.99x over vLLM
- (NAMMs, 2024) evolved neural memory models that outperformed full-context Llama-3-8B by +11% on LongBench while reducing cache size
- (Memory Layers, 2024) proved parametric memory viable at 128 billion parameters, doubling factual accuracy over dense baselines
- (MemoRAG, 2024) introduced dual-system global memory augmented retrieval, outperforming GPT-4-128k on summarization tasks by a large margin
- (MACLA, 2025) demonstrated that external procedural memory built in 56 seconds can outperform models 10x larger, achieving 78.1% average across four benchmarks
- (DC, 2025) enabled test-time learning where GPT-4o improved from 10% to 99% on Game of 24 by curating a persistent memory buffer
- Memory-R1 (Memory-R1, 2025) applied GRPO-based RL to memory operations, achieving +28.5% F1 improvement on LoCoMo with only 152 training examples
- PersonaMem-v2 (PersonaMem-v2, 2025) showed that RL-trained agentic memory enables a 4B model to outperform GPT-5 on implicit personalization using 16x fewer tokens
- (SeCom, 2025) established segment-level memory as superior to both turn-level and session-level retrieval for long conversations
- SAM2(SAM2Act, 2025) achieved 94.3% on memory-dependent robotic tasks by integrating explicit memory banks into visual-motor policies
- (StateLM, 2026) introduced the Pensieve paradigm where models self-manage context via read-note-delete cycles, achieving 52% on deep research tasks versus 5% for standard LLMs
- (PRECEPT, 2026) replaced fuzzy natural language retrieval with deterministic exact-match rule lookup and Bayesian conflict resolution, gaining +41pp over Reflexion on hard tasks
- (Arbiter, 2026) revealed that agent system prompts contain critical memory-related bugs (including data loss in Gemini CLI) detectable via formal analysis for just $0.27
- (DxEvolve, 2026) demonstrated self-evolving clinical diagnosis surpassing human experts (90.4% vs 88.8%) by accumulating diagnostic cognition primitives as memory
- (LycheeCluster, 2026) achieved 3.6x inference speedup through structure-aware hierarchical KV indexing with mathematical safety guarantees
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Paged & Virtual Memory for KV Cache | Treat GPU KV cache like OS virtual memory—allocate physical pages on demand and map them to contiguous virtual addresses, eliminating fragmentation without modifying attention kernels. | Static contiguous allocation that wastes 60-80% of KV cache memory due to over-provisioning for unknown sequence lengths | Efficient Memory Management for Large... (2023), vAttention: Dynamic Memory Management for... (2024), MemServe (2024) |
| Learned & Hierarchical KV Cache Compression | Replace hand-designed cache eviction heuristics with learned models or hierarchical indices that mathematically bound attention scores to safely prune irrelevant cache entries. | Fixed-window or attention-score-based heuristics (like H2O or SnapKV) that miss tokens critical for future decoding steps | LycheeCluster (2026), Where Matters More Than What:... (2026), An Evolved Universal Transformer Memory (2024) |
| RL-Optimized Memory Management | Optimize memory management decisions directly against final answer correctness using RL, letting the agent learn what to store and when to update without explicit supervision. | Static heuristic-based or prompt-instructed memory management that fails to adapt to diverse interaction patterns | Memory-R1 (2025), Mem-α: Training LLMs to Manage... (2025), PersonaMem-v2 (2025) |
| Active Context Engineering | Give the model a deleteContext tool so it can actively curate its working memory, distilling raw input into notes and freeing space for new information. | Standard LLMs that monotonically accumulate context until hitting length limits, then either truncate or fail | StateLM (2026), Distilling Feedback into Memory-as-a-Tool (2026) |
| External Procedural Memory with Contrastive Refinement | Decouple reasoning from learning by storing reusable procedures externally and refining them via contrastive analysis of success/failure pairs. | Parameter fine-tuning approaches that are expensive, entangle reasoning with adaptation, and risk catastrophic forgetting | MACLA (2025), Dynamic Cheatsheet (2025), RetroAgent (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LoCoMo | F1 Score | +28.5% F1 over MemoryOS baseline | Memory-R1 (2025) |
| ALFWorld | Success Rate | 90.3% | MACLA (2025) |
| Needle-in-a-Haystack (NIAH) | Accuracy | 99.5% with 3% KV budget (256 tokens) | Where Matters More Than What:... (2026) |
⚠️ Known Limitations (5)
- Evaluation fragmentation: benchmarks are scattered across conversational QA, agentic tasks, and inference efficiency with no unified evaluation framework, making cross-method comparison difficult and rewarding methods optimized for narrow metrics. (affects: RL-Optimized Memory Management, External Procedural Memory with Contrastive Refinement, Segment-Level Memory with Compression)
Potential fix: Unified evaluation suites like Evo-Memory and ATOD that test memory across multiple dimensions (retention, rewriting, generalization) in a single framework. - Memory rewriting remains underexplored: most systems excel at retaining information but struggle to selectively overwrite outdated content when circumstances change, leading to stale or contradictory memory states. (affects: RL-Optimized Memory Management, Dual-Process Adaptive Retrieval, Segment-Level Memory with Compression)
Potential fix: Diagnostic benchmarks like Endless T-Maze that explicitly test overwrite capabilities, and Bayesian conflict detection mechanisms (as in PRECEPT) to identify and resolve stale knowledge. - Scalability of learned memory policies: RL-based memory management requires significant training infrastructure and may not transfer across different LLM architectures or domains without retraining. (affects: RL-Optimized Memory Management, Learned & Hierarchical KV Cache Compression)
Potential fix: Universal memory models like NAMMs that transfer zero-shot across architectures by operating on attention patterns rather than token embeddings. - Privacy and governance gaps: as memory systems become more persistent and personal, there are no established mechanisms for memory governance, user consent, data minimization, or secure inheritance when models are upgraded. (affects: Dual-Process Adaptive Retrieval, RL-Optimized Memory Management, Segment-Level Memory with Compression)
Potential fix: Constitutional memory architectures with immutable identity layers and formal inheritance protocols, combined with matroid-based data minimization for provable privacy guarantees. - Implicit preference capture: most memory systems are designed for explicit fact storage but struggle to identify and extract user preferences expressed indirectly through behavior patterns rather than explicit statements. (affects: Segment-Level Memory with Compression, Active Context Engineering (Pensieve Paradigm))
Potential fix: RL-trained memory creation (PersonaMem-v2) that learns to detect and store implicit preferences, and adaptive retrieval (RF-Mem) that uses iterative reconstruction for ambiguous queries.
📚 View major papers in this topic (10)
- Efficient Memory Management for Large Language Model Serving with PagedAttention (2023-09) 9
- StateLM: To the Rescue of Long-Horizon Reasoning with Recursive Memory (2026-02) 9
- MACLA: Memory-Augmented Contrastive Learning Agent (2025-01) 9
- PRECEPT: Planning Resilience via Experience, Context Engineering & Probing Trajectories (2026-03) 9
- PersonaMem-v2: Implicit Personas (2025-12) 9
- Emulating Clinician Cognition via Self-Evolving Deep Clinical Research (2026-03) 9
- SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation (2025-01) 9
- Memory in the Age of AI Agents: A Survey (2026-03) 9
- Arbiter: Detecting Interference in LLM Agent System Prompts (2026-03) 9
- Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning (2025-08) 8
💡 Having established the broad landscape of memory organization challenges, we begin with the most fundamental approach: Linear Memory, which maintains a sequential, bounded buffer of experiences and learns when to discard or update entries to keep the agent's working memory compact and relevant.
Linear Memory
What: Linear memory organizes an agent's accumulated experience as a sequential, bounded buffer where entries are created, updated, or discarded over time, rather than stored in graphs or unbounded logs.
Why: As LLM agents tackle long-horizon, multi-turn tasks, unbounded context growth causes quadratic compute costs, performance degradation beyond training lengths, and hallucination from irrelevant history. Linear memory provides a principled way to maintain a compact, relevant working memory.
Baseline: The standard approach is full-context prompting, which appends all prior turns and observations to the prompt. This works for short interactions but fails as history grows, causing context overflow, increased latency, and degraded reasoning quality.
- Deciding what to discard: identifying which memories are no longer relevant without losing information needed for future reasoning
- Balancing compression and fidelity: summarizing or overwriting memory inevitably loses detail, risking the loss of critical facts
- Training memory policies: supervising memory management decisions is difficult because ground-truth labels for 'what to remember' rarely exist
- Generalization across tasks: memory strategies learned on one task distribution often fail to transfer to new domains or longer horizons
🧪 Running Example
Baseline: A full-context system attempts to include all 200+ turns in the prompt. The context exceeds the model's window, so early turns (including Monday's discussion) are truncated. The agent either halluccinates an answer or fails to connect the two discussions.
Challenge: The relevant information spans two specific turns separated by hundreds of irrelevant exchanges. The agent must have retained both pieces of information despite processing many unrelated turns in between, and must be able to retrieve and synthesize them on demand.
📈 Overall Progress
The field shifted from static heuristic memory (append or FIFO eviction) to RL-trained policies that autonomously learn what to remember and forget, achieving 400x context extrapolation.
📂 Sub-topics
RL-Driven Memory Policies
7 papers
Using reinforcement learning to train agents to autonomously decide when to create, update, or discard memory entries, treating memory management as a sequential decision-making problem optimized through outcome rewards.
Memory Compression and Consolidation
5 papers
Techniques that compress accumulated interaction history into bounded, information-dense representations through summarization, thought extraction, or learned consolidation, maintaining constant memory usage regardless of input length.
Experience-Based Self-Improvement
5 papers
Systems that mine past agent execution trajectories to extract reusable procedural knowledge (strategies, recovery tips, rules), building a growing memory bank of actionable lessons from experience.
Neural Memory Architectures
6 papers
Architecture-level designs that augment Transformer models with explicit, differentiable memory modules using gated read/write mechanisms inspired by LSTMs or Hadamard operations.
💡 Key Insights
💡 RL-trained memory policies consistently outperform static heuristics, enabling models to learn what to forget purely from task outcomes.
💡 Fixed-size memory buffers with learned overwrite policies can extrapolate from 8K training contexts to millions of tokens.
💡 Jointly optimizing memory extraction and management prevents noise accumulation that degrades performance over time.
💡 Storing pre-computed conclusions rather than raw text eliminates redundant re-reasoning and reduces retrieval costs.
💡 Dual-memory architectures (fast episodic + slow parametric) combine rapid adaptation with long-term generalization.
💡 Decomposing memory into atomic CRUD operations provides a flexible, learnable framework that scales to unseen context lengths.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (2023) established dual-memory architectures and thought-based storage as alternatives to raw-text memory. By 2025, reinforcement learning emerged as the dominant training paradigm for memory management, enabling models to learn discard/update policies purely from task-outcome rewards. The latest work (2026) focuses on unifying memory extraction and management into jointly optimized frameworks with atomic operations.
- (Two-Memory, 2023) introduced a dual-system architecture combining fast episodic control with slow parametric RL, demonstrating that data sharing between the two systems accelerates learning
- (TAI, 2023) developed a teachable AI framework that learns user preferences from cold start through multi-turn seeker-provider interactions, achieving 97.4% turn-level accuracy
- (TiM, 2023) pioneered storing pre-computed thoughts instead of raw text, with insert/forget/merge operations for memory maintenance
- Talk2(Talk2Drive, 2023) demonstrated the first LLM-based personalization memory in a real-world autonomous vehicle, reducing driver takeover by 75.9%
- (HMF, 2024) introduced element-wise Hadamard products for numerically stable memory updates, achieving O(log t) processing via parallel prefix scan
- (P-RAG, 2024) demonstrated progressive self-improvement in embodied tasks by building a dynamic experience database from the agent's own interaction history
- LM2 (LM2, 2025) introduced a dual-stream memory Transformer with gated updates, outperforming RMT by 37.1% on BABILong while improving general reasoning on MMLU by 5.0%
- MEM1 (MEM1, 2025) unified reasoning and memory consolidation into a single RL-trained step, achieving 3.5x performance gain with 3.7x memory reduction
- (MemAgent, 2025) achieved the highest breakthrough score in this topic by extrapolating from 8K training context to 3.5M-token tasks with <5% loss using RL-trained memory overwrite
- (SUPO, 2025) made summarization a learnable action within the RL training loop, achieving +14% success rate on BrowseComp-Plus with test-time scaling to 23 summaries
- (GSW, 2025) introduced a neuro-inspired generative semantic workspace that models memory as evolving probabilistic state spaces, outperforming HippoRAG2 by up to 20% in recall
- (AtomMem, 2026) decomposed memory management into atomic CRUD operations optimized via GRPO, scaling robustly to 800 documents (4x training size)
- (UMEM, 2026) jointly optimized memory extraction and management using Semantic Neighborhood Modeling, achieving 82.84% Success Rate on ALFWorld with monotonic performance growth
- (MemPO, 2026) introduced dual-reward RL for self-memory policy optimization, gaining +25.98% F1 while cutting token usage by 67.58%
- (Trajectory-Informed, 2026) achieved 149% relative improvement on AppWorld by extracting typed procedural knowledge from execution logs
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| RL-Optimized Memory Overwrite | Treat memory management as a sequential decision-making problem where an RL-trained policy learns to overwrite a fixed-size buffer, retaining only task-critical information. | Full-context prompting and static memory heuristics (e.g., FIFO eviction or fixed summarization intervals) | MemAgent (2025), AtomMem (2026), MemPO (2026), MEM1 (2025) |
| Summarization-Based Context Management | Periodically compress interaction history into learnable summaries that preserve critical state information while keeping working context within model limits. | Naive context truncation (which loses early information) and full-context prompting (which exceeds window limits) | Scaling LLM Multi-turn RL with... (2025), Think-in-Memory (2023) |
| Trajectory Mining for Procedural Memory | Parse agent execution trajectories to extract typed procedural knowledge (strategies, error recoveries, optimizations) that is reused in similar future situations. | Stateless LLM agents that repeat the same errors and cannot reuse successful strategies across sessions | Trajectory-Informed (2026), Progressive Retrieval Augmented Generation for... (2024), CMMR-VLN (2026) |
| Gated Neural Memory Modules | Add an explicit memory matrix to the Transformer with learnable input/forget/output gates that control what information persists across long sequences. | Standard Transformer attention (which degrades over long contexts) and prior memory-augmented models like Recurrent Memory Transformer (RMT) | LM2 (2025), Stable Hadamard Memory (2024), Tell Me What To Learn:... (2026) |
| Unified Memory Extraction and Management | Jointly train memory extraction and management using semantic neighborhood modeling, ensuring each memory generalizes across similar future queries rather than overfitting to one instance. | Static memory extraction pipelines that treat summarization as a fixed preprocessing step, leading to noise accumulation | UMEM (2026), Enabling On-Device Large Language Model... (2023) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| RULER (512K tokens) | Accuracy | >95% | MemAgent (2025) |
| BABILong | Average Accuracy | +37.1% over RMT baseline | LM2 (2025) |
| ALFWorld | Success Rate | 82.84% | UMEM (2026) |
⚠️ Known Limitations (5)
- Information loss from compression is irreversible — once a memory entry is overwritten or discarded, the original detail cannot be recovered, which can be catastrophic when the discarded information turns out to be relevant later. (affects: RL-Optimized Memory Overwrite, Summarization-Based Context Management, One-Step Memory Consolidation)
Potential fix: Hierarchical memory with multiple compression levels (recent detail + older summaries) or reversible memory operations like SoLA's key-deletion approach - RL training for memory policies requires expensive rollouts over long sequences. The reward signal is typically sparse (only final task outcome), making credit assignment to individual memory decisions difficult. (affects: RL-Optimized Memory Overwrite, Atomic Memory Operations via GRPO, Self-Memory Policy Optimization)
Potential fix: MemPO's dense step-level reward (measuring how much memory increases probability of correct answer) and SUPO's sub-trajectory gradient splitting both address sparse reward issues - Memory strategies learned on one task distribution often fail to generalize to new domains or significantly different context lengths, requiring retraining or domain-specific tuning. (affects: RL-Optimized Memory Overwrite, Trajectory Mining for Procedural Memory)
Potential fix: UMEM's Semantic Neighborhood Modeling enforces generalization by evaluating memory quality across clusters of similar queries; AtomMem demonstrates robust scaling to 4x training context lengths - Most approaches are evaluated in single-agent, single-task settings. Scaling linear memory to multi-agent systems or concurrent tasks where memory must be shared or partitioned remains unexplored. (affects: RL-Optimized Memory Overwrite, Gated Neural Memory Modules, Feedback-Driven Personalization Memory)
Potential fix: Chow-Liu ordering optimizes shared memory access order in multi-agent chains; future work could extend atomic memory operations to support concurrent read/write from multiple agents - Evaluation benchmarks for memory quality are limited — most papers evaluate end-task performance rather than directly measuring whether the memory content is optimal, making it hard to diagnose memory failures. (affects: RL-Optimized Memory Overwrite, Unified Memory Extraction and Management, Summarization-Based Context Management)
Potential fix: MemPO's step-level memory quality reward provides a proxy for memory evaluation; future work could develop dedicated memory quality benchmarks
📚 View major papers in this topic (10)
- MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent (2025-07) 9
- AtomMem: Learnable Dynamic Agentic Memory with Atomic Memory Operation (2026-01) 8
- UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory (2026-02) 8
- MemPO: Self-Memory Policy Optimization for Long-Horizon Agents (2026-02) 8
- MEM1: Memory-Efficient Mechanism via learning 1-step integrated reasoning and consolidation (2025-06) 8
- Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management (2025-10) 8
- LM2: Large Memory Models (2025-02) 8
- Trajectory-Informed Memory Generation for Self-Improving Agent Systems (2026-03) 8
- Personalized Autonomous Driving with Large Language Models: Field Experiments (2023-12) 8
- The Generative Semantic Workspace: A Neuro-inspired Framework for Episodic Memory in LLMs (2025-12) 8
💡 While linear buffers provide a clean starting point, their single-tier design forces all memories to compete for the same limited capacity, motivating Layered Memory architectures that separate information into distinct tiers—such as core, semantic, and episodic—each with its own retention and retrieval granularity.
Layered Memory
What: Layered Memory research explores multi-tiered memory architectures for LLM agents that separate information into distinct layers—such as core/profile, semantic/factual, and episodic/temporal—with retrieval operating at session, turn, or topic granularity.
Why: Without structured memory layers, LLM agents lose coherence across extended interactions, cannot personalize responses to individual users, and fail to distinguish between recent context and long-term knowledge.
Baseline: The conventional approach uses a flat retrieval-augmented generation (RAG) pipeline that stores all past interactions as undifferentiated text chunks and retrieves them via semantic similarity search, regardless of memory type or temporal context.
- Balancing memory granularity: too fine-grained (turn-level) fragments semantic topics, while too coarse (session-level) loses important details
- Preventing memory degradation over time: older but important memories get buried by newer, less relevant ones without active consolidation and forgetting mechanisms
- Integrating episodic (event-specific, temporal) and semantic (factual, stable) memories in a unified system that supports cross-type reasoning
- Scaling memory systems efficiently: maintaining low latency and token costs as conversation history grows to hundreds of sessions
🧪 Running Example
Baseline: A flat RAG system retrieves the top-k chunks most similar to 'Alex move Seattle,' but misses the wife's job mention because it appeared in a different session with different keywords. It also cannot reason about 'last month' because all chunks lack temporal indexing.
Challenge: The answer spans two separate conversation sessions (Alex's relocation plans and his wife's career update), requires temporal filtering ('last month'), and demands connecting two semantically distinct but narratively linked memories through the shared entity 'Alex.'
📈 Overall Progress
The field evolved from treating memory as flat text retrieval to structured, multi-layered cognitive architectures with active forgetting, RL-driven evolution, and graph-based associative reasoning.
📂 Sub-topics
Episodic-Semantic Memory Integration
12 papers
Systems that explicitly separate and coordinate episodic memory (event-specific, temporally grounded) and semantic memory (factual, stable knowledge), enabling cross-type reasoning.
OS & Hierarchy-Inspired Memory Management
14 papers
Memory architectures modeled after operating system concepts (RAM/disk paging, segmented memory) or explicit multi-tier hierarchies (sensory, short-term, long-term).
Cognitively-Inspired Memory Architectures
12 papers
Memory systems drawing from cognitive science theories—hippocampal indexing, Ebbinghaus forgetting curves, constructivist learning, and event segmentation—to design biologically plausible agent memory.
Self-Evolving & Agentic Memory
10 papers
Memory systems where agents actively manage, curate, and improve their own memory through reinforcement learning, self-reflection, or experience-based optimization.
Memory Retrieval Optimization
10 papers
Methods that improve how memories are accessed—through tool-augmented retrieval, associative graphs, adaptive reranking, or just-in-time synthesis—moving beyond static top-k similarity search.
💡 Key Insights
💡 Layered memory architectures consistently outperform flat retrieval by separating stable knowledge from temporal events.
💡 Active forgetting mechanisms (Ebbinghaus-inspired decay) are essential to prevent memory pollution from outdated information.
💡 Graph-based associative retrieval discovers connections that vector similarity misses, especially for multi-hop reasoning.
💡 RL-optimized memory curation outperforms static rules, enabling agents to learn what to remember without LLM fine-tuning.
💡 Efficiency gains are dramatic: sensory filtering and sleep-time updates reduce token usage by 38-100x with comparable accuracy.
💡 Comprehensive benchmarks reveal 30-60% accuracy gaps between current systems and oracle retrieval, indicating substantial room for improvement.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Research shifted from simply extending context windows (2023) to designing biologically inspired memory hierarchies with built-in consolidation and forgetting (2024), and most recently toward efficient, self-evolving systems that learn to manage their own memory through reinforcement learning and agentic retrieval (2025-2026).
- (MemGPT, 2023) introduced the OS-inspired virtual context management paradigm, achieving +60.4% accuracy on deep memory retrieval by treating context as RAM and databases as disk
- (LongMem, 2023) proposed a decoupled frozen-backbone + SideNet architecture for stable long-term memory retrieval, achieving state-of-the-art on ChapterBreak
- (MemoryBank, 2023) pioneered Ebbinghaus forgetting curve-based memory decay for natural memory attenuation in LLMs
- (TiM, 2023) introduced storing pre-computed 'thoughts' instead of raw text, decoupling reasoning from retrieval
- (ReadAgent, 2024) demonstrated human-inspired gist memory extending effective context by 3.5-20x while outperforming full-context baselines
- (HippoRAG, 2024) mapped hippocampal pattern completion to knowledge graph retrieval, outperforming single-step methods by up to 20% on multi-hop QA while being 10-20x cheaper
- xLSTM (xLSTM, 2024) revived LSTM architecture with matrix memory and exponential gating, outperforming Mamba and Llama at 400M parameters
- (LongMemEval, 2024) established the first comprehensive benchmark for long-term chat memory, revealing 30-60% accuracy gaps in state-of-the-art commercial systems
- Talk2(Talk2Drive, 2024) demonstrated layered memory for personalized autonomous driving, reducing driver takeover rates by 75.9% in real-world field experiments
- (Memento, 2025) formalized memory-augmented MDPs with neural case selection, achieving top-1 on the GAIA benchmark with 87.88% Pass@3
- (RMM, 2025) introduced bidirectional reflection—prospective topic decomposition and retrospective citation-based reranker training—improving LongMemEval accuracy by 10%
- Two major surveys (Memory in the Age of AI, 2025; Operational Taxonomy, 2025) unified fragmented terminology with Forms-Functions-Dynamics and six atomic operations frameworks
- (G-Memory, 2025) introduced three-tier hierarchical graph memory for multi-agent systems, improving ALFWorld success rate by 20.89%
- (PersonaAgent, 2025) enabled test-time persona optimization through textual gradient loops, improving personalization by 5.7% on LaMP benchmarks
- (Synapse, 2026) unified episodic-semantic memory with spreading activation and lateral inhibition, reducing token consumption by 95% while achieving 40.5 F1 on LoCoMo
- (MM-Mem, 2026) applied Fuzzy-Trace Theory to create pyramidal multimodal memory, achieving state-of-the-art 63.8% on EgoSchema while outperforming Gemini 1.5 Pro
- (UMEM, 2026) jointly optimized memory extraction and management using semantic neighborhood modeling, achieving 82.84% on ALFWorld
- (TA-Mem, 2026) transformed retrieval into an agentic task with multi-indexed tool selection, improving temporal QA by +7.02 F1 over Mem0
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Virtual Context Management | Apply operating system memory management principles (paging, segmentation, eviction) to manage LLM context as a virtual address space. | Fixed context window approaches that truncate or summarize old conversations | MemGPT (2023), MemoryOS (2025), LightMem (2025) |
| Neurocognitive Memory Architectures | Map specific brain regions and cognitive theories onto computational memory components to achieve biologically plausible knowledge management. | Flat vector stores with no forgetting or consolidation mechanisms | HippoRAG (2024), Nemori (2025), MemoryBank (2023), A Miniature Brain Transformer (2026) |
| Graph-Based Associative Retrieval | Replace static similarity search with dynamic graph traversal that discovers associative connections between memories. | Standard top-k vector retrieval that misses structurally linked but semantically distant memories | Synapse (2026), AssoMem (2025), The Generative Semantic Workspace: A... (2025) |
| Pyramidal Multi-Resolution Memory | Store information at multiple abstraction levels and retrieve top-down, expanding details on demand rather than processing everything upfront. | Single-resolution memory that either stores raw data (expensive) or summaries (lossy) | From Verbatim to Gist: Distilling... (2026), A Human-Inspired Reading Agent with... (2024), Enhancing Web Agents with a... (2026) |
| Self-Evolving Memory with Reinforcement Learning | Train a memory retrieval and curation policy via RL rewards, allowing the agent to learn from experience without fine-tuning the LLM. | Static memory management rules and fixed retrieval heuristics that cannot adapt | Memento (2025), UMEM (2026), G-Memory (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LoCoMo | F1 / BLEU-1 | +49.11% F1 improvement over baselines | MemoryOS (2025) |
| LongMemEval | Accuracy | 70.4% accuracy | In Prospect and Retrospect: Reflective... (2025) |
| GAIA (General AI Assistants) | Pass@3 / Accuracy | 87.88% Pass@3 (validation), 79.40% (test) | Memento (2025) |
⚠️ Known Limitations (5)
- Benchmark coverage gaps: Most benchmarks focus on factual recall and multi-hop QA, neglecting dynamic memory operations like updating, forgetting, and conflict resolution that are critical for real-world use. (affects: Virtual Context Management, Graph-Based Associative Retrieval, Neurocognitive Memory Architectures)
Potential fix: Design benchmarks that explicitly test memory update, conflict resolution, and forgetting operations over extended timelines, as proposed by StructMemEval and LongMemEval. - Memory poisoning from bad experiences: Agents that naively store all past interactions accumulate incorrect examples that degrade future performance through 'experience-following' behavior where agents blindly copy retrieved outputs. (affects: Self-Evolving Memory with Reinforcement Learning, Virtual Context Management)
Potential fix: Use strict trajectory evaluators or fine-tuned quality judges to gate memory additions, as demonstrated by regulated memory management achieving +32.4% improvement over add-all baselines. - Scalability vs. precision trade-off: Graph-based and hierarchical memory systems provide better retrieval quality but introduce higher construction and maintenance costs as memory grows to hundreds of thousands of interactions. (affects: Graph-Based Associative Retrieval, Neurocognitive Memory Architectures, Pyramidal Multi-Resolution Memory)
Potential fix: Hybrid approaches combining lightweight online updates with offline consolidation (sleep-time processing), as implemented by LightMem and Memory Bear. - Inability to spontaneously recognize needed memory structures: LLMs struggle to identify when and how to organize memory hierarchically without explicit hints, even when provided with memory tools. (affects: Virtual Context Management, Self-Evolving Memory with Reinforcement Learning)
Potential fix: Provide memory organization hints or use meta-cognitive prompting to guide agents in recognizing structural requirements before task execution. - Fragmented evaluation: No standardized comparison framework exists across different memory architectures, making it difficult to compare approaches that use different benchmarks, metrics, and LLM backends. (affects: Virtual Context Management, Graph-Based Associative Retrieval, Neurocognitive Memory Architectures, Self-Evolving Memory with Reinforcement Learning)
Potential fix: Adopt unified evaluation protocols with standardized benchmarks (LoCoMo, LongMemEval) and controlled LLM backends for fair cross-method comparison.
📚 View major papers in this topic (10)
- MemGPT: Towards LLMs as Operating Systems (2023-10) 9
- Memento: A Novel Learning Paradigm for Adaptive LLM Agents without Fine-tuning (2025-09) 9
- Memory in the Age of AI Agents: A Survey (2025-12) 9
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck (2026-03) 9
- xLSTM: Extended Long Short-Term Memory (2024-05) 9
- HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models (2024-05) 8
- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory (2024-10) 8
- Synapse: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation (2026-01) 8
- LightMem: Lightweight and Efficient Memory-Augmented Generation (2025-10) 8
- A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts (2024-02) 8
💡 Layered architectures organize memories by type, yet still retrieve within each layer via flat similarity search, which misses the associative links between related memories—a limitation addressed by Tree/Graph-based Memory, which connects knowledge through entity, temporal, and causal edges to enable multi-hop reasoning.
Tree/Graph-based Memory
What: Tree/graph-based memory systems organize an agent's long-term knowledge as interconnected nodes in graphs or hierarchical trees, enabling associative retrieval that mirrors how humans hop between related memories by topic, entity, time, or causality.
Why: Standard vector-store retrieval treats memories as isolated items ranked by semantic similarity, missing structural relationships such as causal chains, temporal sequences, and entity connections that are critical for complex reasoning tasks.
Baseline: The conventional approach is flat Retrieval-Augmented Generation (RAG), which encodes text passages as dense vectors and retrieves the top-k most semantically similar chunks to augment an LLM's context window.
- Similarity saturation: as memory grows, many items become semantically close, making top-k retrieval unreliable for distinguishing truly relevant memories
- Cross-document integration: relevant evidence is often scattered across multiple documents or sessions, requiring multi-hop reasoning that flat retrieval cannot perform in a single step
- Memory evolution: real-world knowledge changes over time, so memory structures must support efficient insertion, update, and deletion without full reconstruction
- Balancing structure and cost: building and maintaining rich graph structures (entity extraction, relation inference) is expensive and must justify its overhead over simpler approaches
🧪 Running Example
Baseline: Flat RAG retrieves the trip-booking conversation (highest semantic similarity to 'trip to Tokyo') but misses the family emergency message because it uses different vocabulary. The assistant cannot explain the reason for cancellation.
Challenge: The three memories are semantically distinct (travel planning, personal crisis, administrative action) but causally linked. Connecting them requires traversing entity links (user → trip → cancellation) and temporal ordering, which pure embedding similarity cannot capture.
📈 Overall Progress
The field evolved from flat vector retrieval to richly structured, multi-layered memory graphs with biologically-inspired dynamics like spreading activation, energy minimization, and self-evolving organization.
📂 Sub-topics
Knowledge Graph Memory with Graph Traversal
8 papers
Systems that extract entities and relations from text into knowledge graphs and use graph algorithms (PageRank, beam search, RL traversal) to retrieve interconnected memories for multi-hop reasoning.
Hierarchical and Tree-Structured Memory
5 papers
Systems that organize memories into hierarchical trees via clustering or dependency analysis, enabling top-down retrieval from abstract summaries to specific details.
Associative Memory Theory and Models
7 papers
Theoretical work connecting transformers, diffusion models, and in-context learning to classical associative memory frameworks like Hopfield networks, along with novel architectures built on these principles.
Cognitive-Inspired Multi-Component Memory Architectures
6 papers
Systems inspired by cognitive science that decompose memory into multiple specialized stores (episodic, semantic, procedural) connected via graph structures, with biologically motivated mechanisms like sleep consolidation and spreading activation.
💡 Key Insights
💡 Graph structure enables multi-hop retrieval in a single pass, replacing expensive iterative chain-of-thought retrieval pipelines.
💡 Disentangling memory into semantic, temporal, causal, and entity layers dramatically improves intent-aligned retrieval for different query types.
💡 Self-evolving memory that merges, prunes, and rewrites entries outperforms static append-only stores, especially for long-horizon agents.
💡 Spreading activation on memory graphs surfaces structurally relevant but semantically distant memories that vector similarity misses.
💡 Associative memory theory (Hopfield networks) provides principled foundations for understanding and improving in-context learning and continual adaptation.
💡 Cognitive science frameworks (constructivism, ACT-R, hippocampal indexing) consistently inspire the most effective memory architectures.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work established theoretical connections between attention and associative memory (2023-2024), which catalyzed practical graph-based retrieval systems like HippoRAG. By 2025, the field shifted toward cognitive architectures with multiple specialized memory components and self-evolution capabilities. The latest work (2026) focuses on disentangling relationship types into parallel graph layers and applying neuroscience-inspired dynamics for more principled retrieval.
- (ICL-Hopfield, 2023) established the theoretical equivalence between self-attention and Hopfield network associative memory, providing a principled framework for understanding in-context learning
- (VideoAgent, 2024) demonstrated structured dual-memory (temporal events + object tracking) for video understanding, achieving 26% accuracy improvement on long-form reasoning
- (HippoRAG, 2024) introduced neurobiologically-inspired graph retrieval using a knowledge graph index with Personalized PageRank, achieving up to 20% improvement on multi-hop QA while being 10-20x cheaper than iterative methods
- (Memory Mosaics, 2024) proposed networks of associative memories as a transparent alternative to transformers, matching perplexity while offering interpretable compositional capabilities
- (Embodied-RAG, 2024) introduced the Semantic Forest memory structure for robots, building hierarchical trees 7x faster than GraphRAG and handling kilometer-level environments
- (EMG, 2024) pioneered editable memory graphs with RL-driven traversal for personalized smartphones, supporting dynamic insertion, deletion, and replacement of memories
- (EHAM, 2024) extended entropic associative memory to hetero-associative tasks, achieving perfect recall of 40,000 associations in a single memory instance
- (A-Mem, 2025) introduced Zettelkasten-inspired self-evolving memory with LLM-generated inter-note links, improving multi-hop reasoning by 192% over MemGPT
- Mem0 (Mem0, 2025) proposed dynamic memory management with graph enhancements for multi-session dialogue, reducing latency by 91% while improving personalization by 26%
- Memory Mosaics v2 (Memory Mosaics v2, 2025) scaled associative memory networks to 10B parameters, outperforming transformers by 12-15% on multi-document QA tasks while matching performance on standard benchmarks
- (MIRIX, 2025) deployed a six-component multi-agent memory architecture achieving 35% higher accuracy than RAG baselines while reducing storage by 99.9%
- (CAM, 2025) applied Piaget's constructivist theory to agent memory with assimilation/accommodation mechanisms, running 4x faster than offline clustering baselines
- (AssoMem, 2025) fused graph-based importance, semantic relevance, and temporal alignment via adaptive mutual information weighting, outperforming baselines by 24.93%
- (GSW, 2025) modeled neocortical-hippocampal memory loops for episodic reasoning, outperforming HippoRAG2 by 20% in recall while reducing context tokens by 51%
- (MAGMA, 2026) introduced four disentangled graph layers (semantic, temporal, causal, entity) with intent-aware traversal, outperforming MemoRAG and Hi-Mem on long-context benchmarks
- (Synapse, 2026) unified episodic-semantic memory with spreading activation and lateral inhibition, reducing token consumption by 95% while achieving state-of-the-art on LoCoMo
- (Panini, 2026) replaced chunk-based retrieval with Generative Semantic Workspaces of atomic QA pairs and beam-search reasoning chains, achieving 5-7% gains over GraphRAG and HippoRAG
- (HyMEM, 2026) introduced self-evolving hybrid memory with a VLM Judge for GUI agents, enabling a 7B model to surpass proprietary systems like Gemini-2.5-Pro
- (Routing without Forgetting, 2026) applied Hopfield Pooling for energy-based associative routing in online continual learning, achieving 74.09% accuracy on Split-ImageNet-R with only 2.1% additional parameters
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Knowledge Graph Indexing with PageRank Retrieval | Build a knowledge graph as a hippocampal index and use activation-spreading algorithms to retrieve structurally connected memories that pure semantic similarity would miss. | Standard single-step dense retrieval (RAG) and expensive iterative retrieval methods (IRCoT) | HippoRAG (2024), AssoMem (2025), Panini (2026) |
| Multi-Graph Memory Architecture | Disentangle memory relationships into specialized graph layers so that retrieval can be steered by query intent rather than relying on a single similarity metric. | Monolithic knowledge graphs and single-vector-store memory systems | MAGMA (2026), Mem0 (2025), SGMem (2025) |
| Hierarchical Tree Memory with Top-Down Retrieval | Cluster memories into a navigable tree hierarchy so retrieval can start broad and zoom into relevant details, mimicking how humans organize knowledge from general to specific. | Flat retrieval over large memory pools and brute-force similarity search | Embodied-RAG (2024), CAM (2025), Chow–Liu Ordering for Long-Context Reasoning... (2026) |
| Self-Evolving Structured Memory | Let the LLM actively curate and evolve the memory graph rather than passively appending new entries, so the structure improves as the agent gains experience. | Append-only memory stores and static knowledge graphs that require manual curation | Agentic Memory (2025), Hybrid Self-evolving Structured Memory for... (2026), Crafting Personalized Agents through Retrieval-Augmented... (2024) |
| Spreading Activation on Episodic-Semantic Graphs | Replace static similarity ranking with dynamic energy propagation through a memory graph, so relevance is determined by structural connectivity rather than just vector distance. | Pure embedding-based retrieval and static graph-based retrieval without activation dynamics | Synapse (2026), The Generative Semantic Workspace: A... (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LoCoMo | Weighted F1 / Accuracy | 40.5 Weighted F1 | Synapse (2026) |
| Multi-hop QA (MuSiQue / 2WikiMultiHopQA) | Recall@5 / F1 | Up to 20% improvement in R@5 | HippoRAG (2024) |
| MeetingQA / Similarity-Dense QA | Accuracy | 57.3% Accuracy | AssoMem (2025) |
⚠️ Known Limitations (5)
- Graph construction overhead: extracting entities and relations via LLM calls is expensive (both in latency and cost), making it impractical for real-time or resource-constrained applications. (affects: Knowledge Graph Indexing with PageRank Retrieval, Multi-Graph Memory Architecture, Self-Evolving Structured Memory)
Potential fix: MAGMA's dual-stream approach (fast path for immediate ingestion, slow path for asynchronous LLM-based densification) partially addresses this by decoupling latency-critical operations from expensive graph enrichment. - Scalability of graph traversal: as memory graphs grow to hundreds of thousands of nodes, traversal algorithms like Personalized PageRank or beam search become computationally expensive and may return noisy results. (affects: Knowledge Graph Indexing with PageRank Retrieval, Spreading Activation on Episodic-Semantic Graphs)
Potential fix: Synapse uses lateral inhibition to suppress hub nodes, and region-based pruning (as in MAGMA) can limit traversal scope. Hierarchical tree approaches naturally reduce search space via top-down navigation. - Evaluation fragmentation: there is no unified benchmark for tree/graph-based memory, making it difficult to compare methods fairly. Different papers report on different subsets of benchmarks with varying metrics. (affects: Knowledge Graph Indexing with PageRank Retrieval, Hierarchical Tree Memory with Top-Down Retrieval, Spreading Activation on Episodic-Semantic Graphs)
Potential fix: AssoMem introduced MeetingQA for similarity-dense scenarios, and LoCoMo has emerged as a common benchmark. Standardization around multi-dimensional evaluation (single-hop, multi-hop, temporal reasoning) would help. - Error propagation in knowledge extraction: LLM-based entity and relation extraction is imperfect, and errors in the graph structure (missing edges, wrong relations) cascade into retrieval failures that are hard to diagnose. (affects: Knowledge Graph Indexing with PageRank Retrieval, Self-Evolving Structured Memory, Multi-Graph Memory Architecture)
Potential fix: HyMEM's VLM Judge approach (deciding add/merge/replace based on information gain) and Memory Bear's sleep-based consolidation offer self-correction mechanisms, but robust error detection in memory graphs remains an open problem. - Domain specificity: most systems are validated on text-based QA or dialogue tasks. Transfer to multimodal domains (video, robotics, GUI interaction) requires substantial architectural adaptation. (affects: Hierarchical Tree Memory with Top-Down Retrieval, Associative Memory Networks (Hopfield-Inspired))
Potential fix: Embodied-RAG and HyMEM demonstrate that hybrid spatial-semantic clustering and visual embedding integration can bridge this gap, but general-purpose multimodal memory architectures remain underexplored.
📚 View major papers in this topic (10)
- HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models (2024-05) 8
- Hybrid Self-evolving Structured Memory for GUI Agents (2026-03) 9
- Synapse: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation (2026-01) 8
- MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents (2026-01) 8
- Memory Mosaics at scale (2025-07) 8
- CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension (2025-10) 8
- AssoMem: An Associative Memory Framework for Context-Aware Memory Recall (2025-12) 8
- Panini: Continual Learning in Token Space via Structured Memory (2026-02) 8
- The Generative Semantic Workspace: A Neuro-inspired Framework for Episodic Memory in LLMs (2025-12) 8
- Memorization to Generalization: Emergence of Diffusion Models from Associative Memory (2025-05) 8
💡 Where graph-based approaches store knowledge externally and retrieve it on demand, Memory Internalization takes a fundamentally different path by encoding personal memories and domain facts directly into model parameters through LoRA adapters or embedding injection, trading retrieval flexibility for faster inference and implicit pattern capture.
Memory Internalization
What: Memory internalization refers to techniques that store knowledge—personal memories, domain facts, or experiential data—directly into a language model's parameters, rather than relying on external retrieval or extended context windows.
Why: Parametric memory enables faster inference without retrieval latency, preserves user privacy by keeping data within local model weights, and can capture nuanced behavioral patterns that retrieval-based methods miss when context is noisy or irrelevant.
Baseline: The conventional approach uses Retrieval-Augmented Generation (RAG), which stores information externally and injects relevant passages into the prompt at inference time, or relies on large context windows to process user history directly.
- Catastrophic forgetting: injecting new memories into model parameters often overwrites previously stored knowledge
- Scalability: maintaining separate adapters or memory modules for each user or memory unit becomes resource-intensive as the number of memories grows
- Random access: language models can reproduce memorized information sequentially but struggle to access specific facts from arbitrary positions in stored memory
- Evaluation: measuring whether memories are truly internalized versus superficially memorized, and whether general capabilities are preserved after internalization
🧪 Running Example
Baseline: A standard LLM has no record of past conversations and cannot answer. A basic RAG system must search through all stored conversation logs, potentially retrieving irrelevant restaurant mentions or failing when the user's phrasing differs from the stored text.
Challenge: The assistant must recall a specific personal detail mentioned once in a past conversation, distinguish it from other restaurant mentions, and associate it with the temporal context of 'last month' and 'Rome trip'—requiring both precise memory storage and flexible retrieval from parameters.
📈 Overall Progress
The field evolved from static retrieval interpolation (kNN-LM, 2020) to dynamic, architecture-integrated memory systems that can continuously internalize, route, and even reverse knowledge updates without forgetting.
📂 Sub-topics
LoRA-Based Personal Memory
8 papers
Uses Low-Rank Adaptation (LoRA) adapters to store user-specific or memory-specific knowledge directly in model parameters, enabling personalized responses without modifying the base model.
Latent-Space Memory Architectures
6 papers
Embeds memory directly into the transformer's latent space as trainable vectors or generated tokens, allowing the model to read and write memories through its own attention mechanism.
Retrieval-to-Parameter Knowledge Transfer
5 papers
Converts external retrieval-based knowledge (datastores, document collections) into model parameters through distillation or fine-tuning, combining the precision of retrieval with the speed of parametric inference.
Lifelong Model Editing and Continual Memory
5 papers
Focuses on sequentially updating model parameters with new knowledge while minimizing catastrophic forgetting, enabling models to accumulate memories over long lifetimes of operation.
💡 Key Insights
💡 Per-memory isolation (one frozen LoRA per fact) dramatically reduces catastrophic forgetting compared to shared-parameter fine-tuning.
💡 Combining parametric memory (LoRA) with non-parametric retrieval (RAG) consistently outperforms either approach used alone.
💡 Latent-space memory pools can self-update through attention but hit capacity limits requiring hierarchical CPU offloading solutions.
💡 Language models access memorized information sequentially; random access to specific stored facts remains a fundamental bottleneck.
💡 Generative memory that reconstructs context on demand outperforms static retrieval by producing task-specific cognitive context.
💡 Distilling retrieval into small parametric decoders enables plug-and-play domain adaptation across model scales with minimal latency.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Research shifted from treating memory as external retrieval (2020) to embedding it within model parameters via LoRA and latent pools (2024), and most recently toward scalable, forgetting-resistant architectures with gated routing, generative memory, and sparse residual approaches (2025–2026).
- kNN-LM (kNN-LM, 2020) established that interpolating nearest-neighbor retrieval with model outputs achieves state-of-the-art perplexity, demonstrating the power of explicit memory access without parameter updates.
- (DPeM, 2023) introduced dual-process memory with LoRA for medical assistant personalization, combining biologically-inspired memory tiers (working, short-term, long-term) with parameter-efficient fine-tuning.
- Talk2(Talk2Drive, 2023) demonstrated the first end-to-end LLM-based personalization system in real-world autonomous driving, using a memory module that reduced driver takeover rates by 75.9%.
- (MemoryLLM, 2024) pioneered embedding a 1B-parameter memory pool directly within transformer layers, enabling self-updating knowledge injection with +13.6% accuracy on model editing benchmarks.
- (OPPU, 2024) formalized the one-PEFT-per-user paradigm, achieving state-of-the-art across all 7 LaMP personalization tasks by combining parametric and non-parametric knowledge.
- Memory3 (Memory3, 2024) introduced a three-tier memory hierarchy (text → sparse KV pairs → parameters) that enabled a 2.4B model to outperform Baichuan2-7B while being 1.66x faster than RAG.
- (Random Access, 2024) revealed a fundamental limitation: LMs can reproduce memorized content sequentially but fail at random access, identifying a critical bottleneck for parametric memory.
- (MemLLM, 2024) pioneered training LLMs to generate explicit read/write API calls to structured memory, making memory operations interpretable.
- M+ (M+, 2025) extended MemoryLLM's retention from 20k to 160k+ tokens by offloading evicted memory to CPU with a co-trained retriever, solving the capacity bottleneck of latent-space memory.
- (RAG-Tuned-LLM, 2025) demonstrated that GraphRAG-derived synthetic data can internalize document knowledge into a 7B model, achieving a 77.2% win rate over vanilla RAG on global queries.
- (MEGa, 2025) introduced per-memory LoRA with gated activation, maintaining >90% recall after 50 sequential memory injections while standard baselines collapsed to <10%.
- (MemGen, 2025) proposed generative latent memory with a metacognitive trigger, achieving +31.7% improvement on ALFWorld with strong cross-domain transfer from math to science and code.
- (MEMOIR, 2025) introduced sparse residual memory with TopHash retrieval, sustaining reliable editing through 15,000 sequential updates where all prior methods degraded.
- (Memory Decoder, 2025) distilled kNN retrieval into a plug-and-play 0.5B decoder that adapts LLMs up to 72B parameters with only 1.28x latency overhead.
- (SoLA, 2026) introduced semantic routing over frozen LoRA modules, enabling fully reversible model edits by simply deleting routing keys without affecting other stored knowledge.
- (LCA, 2026) solved the classifier-backbone mismatch problem in continual learning by aligning classifiers to merged PEFT modules using synthetic Gaussian samples, leading on 7 benchmarks.
- (RfR, 2026) formalized reflective MDPs where agents internalize experience through self-generated linguistic feedback and preference-based fine-tuning, outperforming both RL and prompt-based memory agents.
- (Autoencoder Memory, 2026) showed that autoencoder-trained embeddings achieve >99% memory reconstruction accuracy, vastly outperforming causal model embeddings (20-60%) for information retention.
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Per-User LoRA Adapters | Give every user their own tiny set of trainable parameters that personalize a shared frozen model, making personalization modular and privacy-preserving. | Retrieval-augmented personalization (RAG), which fails when retrieved history is noisy or irrelevant, and profile-based prompting, which is limited by context window size. | Democratizing Large Language Models via... (2024), LLM-based (2023), On the Way to LLM... (2024), Parameterized Memory-injected LLM Personalization (2024) |
| Per-Memory LoRA with Gated Routing | Isolate each memory in its own frozen LoRA module and use query-driven routing to selectively activate relevant memories, eliminating catastrophic forgetting entirely. | Standard LoRA fine-tuning, which overwrites previous knowledge when trained on new data sequentially, and shared adapter methods that suffer from semantic drift. | MEGa (2025), Reversible Lifelong Model Editing via... (2026) |
| Latent-Space Memory Pools | Add a bank of trainable vectors inside the transformer that the model can read and update through attention, creating a self-contained memory system within the architecture. | External retrieval systems (RAG) that require separate infrastructure, and context-window approaches that are limited by fixed input length. | MemoryLLM (2024), M+: Extending MemoryLLM with Scalable... (2025), Adaptive Loops and Memory in... (2026) |
| Generative Latent Memory | Generate memory tokens dynamically through a separate module that reconstructs relevant context only when the reasoning process needs it, mimicking human recall as active reconstruction. | Static retrieval-based memory (which returns fixed passages) and direct parameter updates (which cause forgetting), by dynamically synthesizing task-relevant memories. | MemGen (2025) |
| Retrieval-to-Parameter Distillation | Compress the knowledge in a retrieval datastore into model weights so the model can produce retrieval-quality answers without actually performing retrieval at inference time. | Standard RAG, which incurs latency from nearest-neighbor search and requires maintaining large external datastores at inference time. | Generalization Through Memorization (2020), Memory Decoder (2025), Tuning LLMs by RAG Principles:... (2025), Memory3 (2024) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LaMP Personalization Benchmark | Various (MAE, Accuracy, ROUGE-1) | +17.38% relative MAE improvement on LaMP-3 | Democratizing Large Language Models via... (2024) |
| Sequential Knowledge Retention | Recall Cosine Similarity / QA Accuracy | >90% recall cosine similarity after 50 sequential tasks | MEGa (2025) |
| Lifelong Model Editing (zsRE / SCOTUS) | Edit Reliability Rate (ERR) / Edit Success | High reliability maintained after 15,000 sequential edits | MEMOIR (2025) |
⚠️ Known Limitations (5)
- Catastrophic forgetting persists in sequential updates: even with LoRA, injecting many memories sequentially causes older memories to degrade unless each memory is fully isolated, which increases storage costs linearly. (affects: Per-User LoRA Adapters, Latent-Space Memory Pools)
Potential fix: Per-memory isolation (MEGa, SoLA) and sparse residual updates (MEMOIR) mitigate forgetting but at the cost of linear storage growth per memory unit. - Scalability of per-memory modules: approaches creating a separate LoRA or memory entry per fact become resource-intensive as memory counts grow to thousands or millions, limiting real-world deployment. (affects: Per-Memory LoRA with Gated Routing, Explicit Structured Memory with API Access)
Potential fix: Memory compression, hierarchical routing, and shared adapter pools could reduce per-memory overhead while maintaining isolation benefits. - Random access bottleneck: models can reproduce memorized content from the beginning but struggle to access specific facts at arbitrary positions, limiting the utility of parametric memory for precise fact lookup. (affects: Per-User LoRA Adapters, Latent-Space Memory Pools, Retrieval-to-Parameter Distillation)
Potential fix: Training with permuted sentence order and recitation-based inference partially address this, but a general solution for arbitrary random access remains open. - Evaluation gaps: most work evaluates on synthetic or narrow benchmarks (fictional characters, curated QA pairs), and it remains unclear how well methods generalize to realistic, open-ended personalization at scale. (affects: Per-User LoRA Adapters, Per-Memory LoRA with Gated Routing, Latent-Space Memory Pools)
Potential fix: Development of comprehensive personalization benchmarks with realistic multi-turn conversations and long-term memory requirements over months of interaction. - General capability degradation: fine-tuning for memory internalization can reduce performance on general NLP tasks (MMLU, commonsense reasoning), creating a memorization-generalization trade-off. (affects: Per-User LoRA Adapters, Retrieval-to-Parameter Distillation, Self-Reflective Parameter Updates)
Potential fix: Freezing base model weights and using modular adapters (OPPU, MEGa) or residual memory layers (MEMOIR) helps preserve general capabilities while adding new knowledge.
📚 View major papers in this topic (10)
- Generalization Through Memorization: Nearest Neighbor Language Models (2020-11) 9
- MemGen: Weaving Generative Latent Memory for Self-Evolving Agents (2025-09) 9
- MemoryLLM: Towards Self-Updatable Large Language Models (2024-02) 8
- M+: Extending MemoryLLM with Scalable Long-Term Memory (2025-02) 8
- Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models (2025-08) 8
- Memory3: Language Modeling with Explicit Memory (2024-07) 8
- MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention (2025-12) 8
- LCA: Local Classifier Alignment for Continual Learning (2026-03) 8
- Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning (2024-02) 7
- MEGa: Memory Embedded in Gated LLMs (2025-04) 7
💡 Regardless of whether memories are stored in external structures or internalized into parameters, unbounded accumulation eventually degrades performance, which is why Memory Consolidation and Compression develops methods for summarizing, deduplicating, and distilling memories into compact representations that preserve essential information.
Memory Consolidation and Compression
What: Memory consolidation and compression encompasses methods for summarizing, compressing, deduplicating, and distilling memories so that LLM-based agents can maintain compact, relevant memory stores across long-horizon interactions without exceeding context window limits.
Why: As LLM agents engage in extended conversations, multi-step reasoning, and document analysis, raw interaction history grows unboundedly, degrading performance and increasing cost. Effective memory consolidation enables agents to retain critical information while discarding redundancy, making long-term personalized AI assistants practical.
Baseline: The conventional approach simply appends all past interaction turns to the prompt (full-context prompting), which causes linear memory growth, increased latency, and performance degradation when context length exceeds training limits. Alternatively, naive retrieval-augmented generation (RAG) treats memory as a flat vector store with no lifecycle management.
- Deciding what to keep and what to discard: compression inevitably loses some information, and the system must learn which details are critical for future tasks
- Maintaining coherence across compression steps: repeated summarization can cause semantic drift, hallucination, or loss of causal and temporal relationships
- Balancing latency and quality: online compression must be fast enough for real-time interaction, while thorough reorganization requires expensive offline processing
- Scaling to diverse memory types: systems must handle episodic conversations, factual knowledge, procedural skills, and multimodal data under a unified compression framework
🧪 Running Example
Baseline: A standard LLM with a fixed context window only retains the most recent few sessions. The conversation from two weeks ago has been truncated, so the model responds 'I don't have information about that conversation.' Full-context prompting with all 50 past sessions would require ~100K tokens, exceeding limits and slowing inference.
Challenge: The relevant detail ('Sarah mentioned her new marketing role at Acme Corp') is buried in one of dozens of past sessions. Simple keyword retrieval may miss it if the user never said 'job' explicitly, and storing all raw transcripts is infeasible. The system must have compressed past sessions intelligently enough to retain this personal detail while discarding small talk.
📈 Overall Progress
Memory consolidation has shifted from static heuristic-based compression to learned, RL-optimized policies that jointly train task performance and memory management as a unified objective.
📂 Sub-topics
Gist and Summarization-Based Compression
5 papers
Methods that compress verbose conversation history or document content into compact natural-language summaries (gists), preserving key semantic content while dramatically reducing token count.
OS-Inspired Hierarchical Memory Management
4 papers
Architectures that borrow operating system concepts (virtual memory, paging, context switching, lifecycle management) to manage LLM memory across fast and slow storage tiers.
RL-Trained Memory Consolidation
3 papers
Approaches that use reinforcement learning to teach models what information to retain, compress, or discard, optimizing memory management as a learned policy rather than a fixed heuristic.
Representation-Level Compression
3 papers
Methods that compress memory at the representation level, including dynamic KV-cache merging, matrix-form memory cells, and visual rendering of code to reduce token counts.
Bio-Inspired and Cognitive Memory Architectures
3 papers
Systems that draw from neuroscience and cognitive science (hippocampal consolidation, episodic-semantic memory separation, Ebbinghaus forgetting curves) to design memory consolidation mechanisms.
💡 Key Insights
💡 RL-trained memory consolidation consistently outperforms fixed heuristics, with models learning task-specific compression strategies that balance recall and efficiency.
💡 Sleep-time offline consolidation (asynchronous reorganization between sessions) is a recurring pattern that dramatically reduces online latency.
💡 The OS memory hierarchy analogy (RAM/disk paging) has become a foundational paradigm, adopted by MemGPT, AIOS, MemOS, and IronEngine.
💡 Gist-based compression can paradoxically improve accuracy over full context by filtering distracting information that degrades attention.
💡 Multi-graph memory structures that disentangle temporal, causal, and semantic relationships enable intent-aware retrieval far superior to flat vector stores.
💡 Bio-inspired forgetting mechanisms (Ebbinghaus decay, active pruning) are essential for preventing unbounded memory growth in long-lived agents.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
The field evolved from MemGPT's foundational OS-inspired memory paging (2023) through production-grade cognitive architectures with sleep-time consolidation (2025), converging in 2026 toward RL-trained memory policies, structured multi-graph memory, and bio-inspired architectures that unify compression with reasoning.
- (MemGPT, 2023) pioneered virtual context management, treating the LLM context window as RAM and external storage as disk, achieving +60.4% accuracy on deep memory retrieval
- (ReadAgent, 2024) introduced human-inspired gist memory that extends effective context by up to 20x while outperforming retrieval baselines by 31.98% ROUGE-L on NarrativeQA
- (DMC, 2024) introduced dynamic KV-cache compression with per-head learned merge decisions, achieving 350-390% throughput gains on H100 GPUs
- (AIOS, 2024) extended the OS paradigm to multi-agent systems with syscall-based memory access and 2.1x throughput improvement
- xLSTM (xLSTM, 2024) revived LSTMs with exponential gating and matrix memory, outperforming Transformers on 99.5% of text domains in PALOMA
- (MemOS, 2025) formalized memory as a first-class OS resource with MemCube containers and automatic format transitions
- MEM1 (MEM1, 2025) demonstrated that RL can train models to maintain a single evolving internal state, improving performance 3.5x while reducing memory 3.7x
- (SUPO, 2025) jointly optimized task-solving and summarization via RL, achieving +14.0% success rate on BrowseComp-Plus with test-time scaling to 23 summary steps
- (LightMem, 2025) introduced sensory filtering and sleep-time consolidation, reducing token usage by 38x while improving accuracy by 29.3% on LoCoMo
- (Memory Bear, 2025) implemented active forgetting via Ebbinghaus decay curves and offline sleep-based memory reorganization
- (MAGMA, 2026) introduced multi-graph memory with four parallel relationship graphs and intent-aware retrieval, outperforming MemoRAG and Hi-Mem on LoCoMo
- (LongCodeOCR, 2026) replaced textual code compression with visual rendering, improving CompScore by 36.85 points while reducing compression latency from hours to minutes
- (MemPO, 2026) achieved +25.98% F1 gain using dual-reward RL that measures memory quality by its impact on answer correctness
- (MM-Mem, 2026) achieved state-of-the-art 63.8% on EgoSchema with pyramidal multimodal memory using fuzzy-trace theory and entropy-driven retrieval
- (AutoAgent, 2026) unified evolving cognition with elastic memory, compressing history into episodic abstractions and reusable skills
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Gist-Based Memory Summarization | Replace raw text with compact, human-readable summaries at multiple abstraction levels, enabling 3-20x effective context expansion while preserving decision-critical information. | Full-context prompting (appending all raw history) and naive truncation (dropping oldest turns) | A Human-Inspired Reading Agent with... (2024), From Verbatim to Gist: Distilling... (2026), LightMem (2025), Memory Bear (2025) |
| Virtual Context Management | Apply OS concepts like virtual memory, paging, and context switching to give LLMs the illusion of unlimited memory within a fixed context window. | Fixed context windows with no memory management and stateless session-based interactions | MemGPT (2023), AIOS (2024), MemOS (2025), IronEngine (2026) |
| RL-Optimized Memory Consolidation | Train models via reinforcement learning to proactively manage their own memory, treating 'what to remember' as a learnable decision optimized for task success. | External memory modules with fixed heuristics and prompt-based summarization without task-aligned optimization | MEM1 (2025), Scaling LLM Multi-turn RL with... (2025), MemPO (2026) |
| Dynamic KV-Cache Compression | Replace the fixed 'always append' KV-cache update with a learned decision to either append or merge new tokens into existing cache slots, achieving 4-8x compression with minimal quality loss. | Standard KV-cache that grows linearly with sequence length and grouped query attention (GQA) | Dynamic Memory Compression (2024), xLSTM: Extended Long Short-Term Memory (2024) |
| Multi-Graph Structured Memory | Disentangle memory relationships into separate typed graphs so that retrieval can prioritize the right relationship type (temporal, causal, or semantic) based on query intent. | Monolithic vector stores that rely solely on semantic similarity for retrieval | MAGMA (2026), AutoAgent (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LoCoMo | Accuracy | 29.3% improvement over baselines | LightMem (2025) |
| NarrativeQA (Gutenberg) | ROUGE-L | 31.98% ROUGE-L improvement over retrieval baselines | A Human-Inspired Reading Agent with... (2024) |
| EgoSchema | Accuracy | 63.8% | From Verbatim to Gist: Distilling... (2026) |
⚠️ Known Limitations (5)
- Information loss during compression is difficult to predict: critical details may be discarded during summarization, and there is no reliable way to know what was lost until the information is needed later. (affects: Gist-Based Memory Summarization, RL-Optimized Memory Consolidation, Dynamic KV-Cache Compression)
Potential fix: Hierarchical gist systems (like ReadAgent and MM-Mem) mitigate this by allowing drill-down from summaries to raw text, and RL-based approaches learn to retain task-relevant details. - Evaluation is fragmented across different benchmarks and domains, making it hard to compare methods fairly. Most papers evaluate on different tasks with different metrics, and no universal memory consolidation benchmark exists. (affects: Gist-Based Memory Summarization, Virtual Context Management, RL-Optimized Memory Consolidation)
Potential fix: Standardized benchmarks like LoCoMo and LongMemEval are emerging but still limited in scope; the field needs unified evaluation protocols covering conversation, reasoning, and multimodal memory tasks. - Scalability to truly long-lived agents (months or years of interactions) remains unproven. Most evaluations cover hours to days of interaction, and it is unclear if compression strategies degrade gracefully over much longer horizons. (affects: Virtual Context Management, Gist-Based Memory Summarization, Multi-Graph Structured Memory)
Potential fix: Active forgetting mechanisms (Memory Bear's Ebbinghaus decay) and automatic memory format transitions (MemOS) are early steps toward long-horizon memory lifecycle management. - Hallucination risk during consolidation: when models generate summaries or compress memories, they may introduce fabricated details or subtly alter facts, especially under aggressive compression ratios. (affects: Gist-Based Memory Summarization, RL-Optimized Memory Consolidation)
Potential fix: MM-Mem's entropy-driven retrieval drills down to raw data when uncertainty is high; SUPO's joint optimization teaches the model to write faithful summaries by directly penalizing downstream task failures. - Computational overhead of memory management itself can be significant: maintaining multiple graphs, running offline consolidation, or performing RL training adds complexity that may negate throughput gains for smaller deployments. (affects: Multi-Graph Structured Memory, Bio-Inspired Memory Architecture, RL-Optimized Memory Consolidation)
Potential fix: LightMem's sensory filtering reduces overhead by 100x at test time, and IronEngine's hash-based deduplication provides a lightweight alternative to full model-based consolidation.
📚 View major papers in this topic (10)
- MemGPT: Towards LLMs as Operating Systems (2023-10) 9
- xLSTM: Extended Long Short-Term Memory (2024-05) 9
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents (2026-03) 9
- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference (2024-03) 8
- A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts (2024-02) 8
- LightMem: Lightweight and Efficient Memory-Augmented Generation (2025-10) 8
- MEM1: Memory-Efficient Mechanism via learning 1-step integrated reasoning and consolidation (2025-06) 8
- MemPO: Self-Memory Policy Optimization for Long-Horizon Agents (2026-02) 8
- MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents (2026-01) 8
- AIOS: LLM Agent Operating System (2024-03) 8
💡 While Memory Organization tackles the fundamental question of how to structure stored knowledge for efficient access, the real test of any memory architecture is whether it enables accurate recall when users need it—which is precisely what Memory Recall research evaluates through increasingly sophisticated benchmarks spanning conversational QA, multi-modal retrieval, and temporal reasoning.
Memory Recall
What: This topic covers methods for retrieving, managing, and utilizing information from past interactions, stored contexts, and long-term memory in LLM-based systems to answer user recall questions and support personalized, context-aware generation.
Why: As LLM agents handle increasingly complex, multi-session tasks, effective memory recall is essential for maintaining coherence, personalizing responses, and enabling users to retrieve information from their interaction histories.
Baseline: Conventional approaches either stuff the entire conversation history into the context window (which is limited and expensive) or use simple vector-similarity retrieval over stored interactions, which often misses nuanced or structurally complex queries.
- Scaling memory across long interaction histories without exceeding context window limits or losing critical details
- Retrieving the right information when queries require reasoning over multiple past events rather than simple keyword matching
- Balancing memory fidelity with privacy, as stored interactions may contain sensitive personal information
- Evaluating memory capabilities reliably, since existing benchmarks often test only shallow retrieval rather than complex recall
🧪 Running Example
Baseline: A standard LLM with vector-similarity retrieval searches for 'Thai restaurant Portland' and returns the most semantically similar stored interaction. It might retrieve a conversation about Thai food in general but miss the specific Portland trip context, or fail entirely if the conversation was months ago and the memory has been evicted.
Challenge: This query requires multi-hop recall (linking a trip, a restaurant recommendation, and a sentiment), temporal reasoning ('last summer'), and the ability to search across a potentially large history of interactions spanning months.
📈 Overall Progress
Memory recall has evolved from static test-time retrieval to structured, value-aware agent memory systems with dedicated evaluation frameworks.
📂 Sub-topics
Memory-Augmented Model Architectures
5 papers
Methods that integrate explicit memory mechanisms directly into transformer architectures to extend effective context and improve recall during generation.
Agent Memory Systems
4 papers
Long-term memory architectures for autonomous LLM agents that store, organize, and retrieve information from past agent-environment interactions.
Memory-Driven Personalization
4 papers
Methods that leverage stored user interaction histories and profiles to deliver personalized outputs, recommendations, and emotionally consistent responses.
Context Compression & Efficient Caching
4 papers
Techniques for reducing the computational and memory costs of processing long contexts by compressing, caching, or selectively attending to stored information.
Memory Evaluation & Benchmarks
4 papers
Frameworks, benchmarks, and simulators for systematically evaluating how well LLM systems can store, retrieve, and reason over information in memory.
Privacy & Safety in Memory
2 papers
Research on privacy risks arising from LLM memory systems and tools for auditing what personal information models can recall or infer.
💡 Key Insights
💡 Flat vector-similarity retrieval fails for agent memory; structured representations like causality graphs are needed for complex recall.
💡 Even frontier models like GPT-4o struggle with composite memory tasks requiring state tracking and multi-hop reasoning.
💡 Attention patterns remain stable within semantic spans, enabling dramatic speedups through slow-fast decoding strategies.
💡 Emotional and contextual signals significantly improve memory retrieval beyond pure semantic similarity.
💡 Proactive cache population during idle time outperforms reactive caching for mobile and latency-sensitive applications.
💡 Joint training of memory representations with the language model substantially outperforms test-time-only memory injection.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work focused on integrating memory into model training (TRIME) and establishing personalization benchmarks (LaMP). The field then shifted toward context efficiency (compression, caching) and rigorous evaluation frameworks, before converging on structured agent memory systems that preserve causal relationships and adapt retrieval strategies based on task utility.
- (TRIME, 2022) introduced joint training of language models with in-batch memory, reducing WikiText-103 perplexity from 18.70 to 15.37 and outperforming test-time-only approaches like kNN-LM
- (LaMP, 2023) established the first comprehensive personalization benchmark for LLMs with 7 diverse tasks, showing retrieval augmentation improves output quality by +23.5% over non-personalized baselines
- (ARM-RAG, 2023) proposed storing successful reasoning chains as retrievable memory with obfuscation-based retrieval, improving math problem solving by +4.2% on GSM8K
- (Entropy-Aware, 2024) identified attention entropy as the root cause of parallel context encoding failures and proposed shared attention sinks to restore recall accuracy from ~0% to near 100%
- (Selective Compression, 2024) preserved key information as raw tokens while compressing tool documentation at up to 16× ratio without performance loss
- (MemSim, 2024) introduced Bayesian-causal data synthesis for reliable memory evaluation, achieving >99% ground truth correctness while revealing that GPT-4 still struggles with aggregative and multi-hop recall
- (Emotional RAG, 2024) incorporated mood-congruent retrieval into role-playing agents, improving MBTI personality accuracy from 59.74% to 67.53%
- (PerCache, 2025) introduced predictive hierarchical caching for mobile RAG, reducing end-to-end latency by 34.4% through proactive query generation during idle time
- (Memory Framework, 2025) decomposed memory into atomic capabilities, revealing that even GPT-4o drops to ~45% accuracy on composite recall tasks like Theory of Mind
- (APC, 2025) shifted from query-level to task-level plan template caching, cutting agent costs by 50.31% and latency by 27.28% while preserving 96.6% performance
- (CAT, 2025) matched dense transformer quality while being 1.4–3× faster and 2–9× more memory efficient via parallel chunk compression with test-time adaptivity
- (AMA-Bench, 2026) revealed that existing memory systems significantly underperform on agentic tasks, with its AMA-Agent outperforming the strongest baselines by 11.16% via causality graph retrieval
- (EvoKernel, 2026) used Q-value-driven memory retrieval to boost NPU kernel correctness from 11% to 83%, demonstrating emergent cross-task memory transfer
- Prism-Δ (Prism-Δ, 2026) introduced dual-channel differential subspace steering for prompt highlighting, achieving +10.6% relative gain over the best prior baseline (SEKA)
- (SFI, 2026) achieved up to 14.4× throughput improvement via slow-fast decoding that refreshes sparse caches only at semantic boundaries
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Memory-Augmented Language Modeling | Train language models to jointly optimize context representations with external memory lookups, rather than adding memory only at test time. | Standard transformer language models limited to fixed context windows, and test-time-only memory methods like kNN-LM that suffer from representation misalignment. | Training Language Models with Memory... (2022), Compress & Attend Transformer (2025), Addressing Hallucinations in LLMs with... (2024) |
| Sparse Attention & Slow-Fast Decoding | Decouple generation into frequent low-cost steps using a fixed sparse cache and rare dense steps that refresh memory at natural semantic boundaries. | Full-KV attention baselines that redundantly recompute attention over the entire growing history at every decoding step. | Slow-Fast Inference (2026), Attention Entropy is a Key... (2024) |
| Structured Agent Memory | Replace flat similarity-based memory retrieval in agents with structured representations (graphs, templates, value functions) that capture causal and logical dependencies. | Standard vector-similarity retrieval (RAG) and semantic caching, which lose causal structure and fail on machine-generated, symbol-heavy agent logs. | AMA-Bench (2026), Agentic Plan Caching (2025), Towards Cold-Start Drafting and Continual... (2026) |
| Retrieval-Augmented Personalization | Personalize LLM outputs by retrieving and contextualizing relevant memories from a user's history, using richer signals than simple text similarity. | One-size-fits-all LLM generation that ignores individual user histories, and basic RAG systems that use only semantic similarity for retrieval. | LaMP (2023), Emotional RAG (2024), ARAG (2025) |
| Context Compression & Selective Caching | Preserve key information (names, parameters, critical spans) in raw form while aggressively compressing descriptive or redundant content into compact representations. | Full-context baselines that waste compute on redundant information, and naive compression methods that lose critical details like parameter names. | Concise and Precise Context Compression... (2024), PerCache (2025), Prism-Δ: Differential Subspace Steering for... (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| WikiText-103 (Language Modeling) | Perplexity (lower is better) | 15.37 | Training Language Models with Memory... (2022) |
| AMA-Bench (Agent Memory) | Average Accuracy | 57.22% | AMA-Bench (2026) |
| LaMP (Personalization) | Relative Average Improvement over non-personalized baselines | +23.5% | LaMP (2023) |
⚠️ Known Limitations (5)
- Most memory evaluation benchmarks focus on text-only recall, leaving multi-modal memory (images, audio, video from past interactions) largely untested, which limits our understanding of how well systems handle visual or auditory recall. (affects: Memory Evaluation Frameworks, Retrieval-Augmented Personalization)
Potential fix: Extending evaluation frameworks to incorporate multi-modal interaction logs and cross-modal retrieval tasks, as hinted by Context-as-Memory's video-based approach. - Privacy risks scale with memory capability: more effective memory systems store and can leak more sensitive personal information, creating a fundamental tension between utility and privacy. (affects: Structured Agent Memory, Retrieval-Augmented Personalization)
Potential fix: Differential privacy mechanisms for memory storage, user-controlled memory deletion, and periodic privacy audits using tools like LMP2. - Structured memory approaches (causality graphs, plan templates) require domain-specific schema design, making them difficult to generalize across diverse agent applications without significant engineering effort. (affects: Structured Agent Memory, Context Compression & Selective Caching)
Potential fix: Automated schema induction from interaction logs, or hybrid approaches that combine structured and unstructured memory as in AMA-Agent's tool-augmented retrieval. - Context compression methods trade off fidelity for efficiency, and the optimal compression ratio varies significantly across tasks, requiring careful tuning that may not transfer between applications. (affects: Context Compression & Selective Caching, Sparse Attention & Slow-Fast Decoding)
Potential fix: Adaptive compression that dynamically adjusts ratios based on task requirements, as demonstrated by CAT's test-time chunk size adaptivity. - Over-reliance on AI memory systems may cause cognitive atrophy in users, reducing their own ability to recall and reason about information they have offloaded to the AI. (affects: Retrieval-Augmented Personalization, Memory-Augmented Language Modeling)
Potential fix: Designing memory interfaces that encourage active user engagement rather than passive consumption, such as prompting users to recall before revealing stored information.
📚 View major papers in this topic (10)
- Compress & Attend Transformer (2025-12) 9
- LaMP: When Large Language Models Meet Personalization (2023-04) 9
- Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis (2026-03) 9
- Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task (2025-06) 9
- AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications (2026-02) 8
- Training Language Models with Memory Augmentation (2022-11) 8
- Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability (2026-03) 8
- MemSim: A Bayesian Simulator for Automatic Evaluation of Memory in LLM-based Personal Assistants (2024-09) 8
- How Effectively Can AI Assistants Utilize Their Memory? A Framework for Extensive Evaluation (2025-03) 8
- Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents (2025-06) 8
💡 With the memory recall problem framed, we first examine Sparse Memory QA, where the core challenge is locating and aggregating relevant information that is thinly scattered across a large memory store, often requiring multi-hop reasoning over fragmentary evidence.
Sparse Memory QA
What: Sparse Memory QA addresses the challenge of answering questions when relevant information is distributed thinly across stored memories or knowledge representations, requiring selective retrieval and aggregation of scattered evidence.
Why: Real-world knowledge is inherently fragmented—personal memories, entity facts, and contextual details are spread across many sources, making it critical for systems to locate and combine sparse signals accurately.
Baseline: Standard language models encode knowledge in dense parameters and retrieve it implicitly during generation, often hallucinating when the needed fact is rare or absent; simple retrieval-augmented approaches concatenate retrieved passages but struggle when evidence must be assembled from multiple sparse sources.
- Locating the few relevant memory entries among a large pool of stored information, especially when queries use vague temporal or spatial cues
- Aggregating evidence across multiple sparse memory fragments to compose a coherent answer, rather than relying on a single retrieved passage
- Scaling memory access efficiently so that adding more stored knowledge does not proportionally increase computation cost
🧪 Running Example
Baseline: A standard retrieval-augmented LLM embeds the query and retrieves the top-k most semantically similar memory entries. It may return memories about Italian food or conferences in general but miss the specific visit because no single memory explicitly states all details together, and it cannot resolve 'last Tuesday' to a concrete date.
Challenge: The answer depends on combining at least two sparse memories—a photo taken at a restaurant (with an OCR-readable sign) and a calendar entry showing a conference on that date—while correctly interpreting the vague time reference 'last Tuesday.'
📈 Overall Progress
Research evolved from supervised memory reading to fully differentiable sparse memory architectures that scale to millions of entities and personal multimodal memories.
💡 Key Insights
💡 Multi-hop attention over memory enables iterative reasoning that single-pass retrieval cannot achieve.
💡 Sparse entity-specific memory access can match or outperform models 10x larger in parameter count.
💡 Fixed vocabulary-based routing outperforms learned dynamic routing for knowledge-intensive tasks.
💡 Multimodal memory QA benefits greatly from offline metadata augmentation and multi-signal retrieval.
💡 Reducing supervision requirements (from labeled supporting facts to end-to-end training) dramatically broadens applicability.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work focused on making memory access differentiable (MemN2N); subsequent research embedded entity-specific knowledge into sparse model components (EAE, MoWE); the latest work extends sparse memory QA to multimodal personal settings with structured retrieval signals (Pensieve).
- MemN2N (MemN2N, 2015) introduced continuous multi-hop attention over external memory, eliminating the need for strong supervision and achieving 3.2% mean error on bAbI QA tasks
- (EAE, 2020) replaced dense parameter lookup with sparse entity-specific memory slots, achieving 43.2% EM on TriviaQA with 10x fewer parameters than T5-3B
- (Survey, 2023) provided a unified taxonomy of retrieval (sparse vs. dense) and generation (concatenation vs. fusion) strategies for memory-augmented models
- (MoWE, 2024) introduced fixed vocabulary-based routing to turn MoE experts into semantic memory slots, outperforming T5-XL on TriviaQA at 8.6x fewer FLOPs
- (Pensieve, 2025) combined offline memory augmentation with multi-signal retrieval (time, location, semantics), improving QA accuracy by up to 14% over standard multimodal RAG
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| End-to-End Memory Networks | Multiple soft-attention hops over memory allow iterative refinement of which sparse facts are relevant, without requiring supervision on which memories to read. | Original Memory Networks (Weston et al., 2015), which required explicit supervision labels for supporting facts at each layer | End-To-End (2015) |
| Entities as Experts | Replace dense parameter lookup with sparse, entity-specific memory slots that are activated only when the corresponding entity is mentioned. | Dense Transformer models (e.g., T5) that store all knowledge in shared parameters, requiring massive parameter counts to recall rare entity facts | Entities as Experts (2020) |
| Mixture-of-Word-Experts | Assign each word or entity to a dedicated FFN expert using a fixed vocabulary-based routing, turning MoE experts into semantic memory slots. | Standard MoE models with learned routing (e.g., GShard Top-2) that lack semantic specialization, and dense models (e.g., T5-XL/XXL) that require proportionally more FLOPs to scale | Memory Augmented Language Models through... (2024) |
| Pensieve | Pre-augment multimodal memories with structured metadata and retrieve using multiple explicit signals (time, location, semantics) rather than relying on a single embedding similarity. | Standard multimodal RAG pipelines that rely solely on semantic embedding similarity and cannot handle vague temporal or spatial references | Memory-QA (2025) |
| Taxonomy of Memory-Augmented LLMs | Organize the landscape of memory-augmented language models by their retrieval strategy (sparse vs. dense) and generation strategy (concatenation vs. fusion). | Ad-hoc descriptions of individual retrieval-augmented systems, which lacked a unifying categorization | Memory-Augmented (2023) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| TriviaQA (Open-Domain) | Exact Match (EM) | 44.8% | Memory Augmented Language Models through... (2024) |
| bAbI QA Tasks (10k training) | Mean Error Rate | 3.2% | End-To-End (2015) |
| MemoryQA | QA Accuracy | Up to 14% improvement over SOTA MM-RAG | Memory-QA (2025) |
⚠️ Known Limitations (4)
- Entity coverage dependency: Sparse entity memory methods (EAE, MoWE) require pre-defined entity vocabularies, so they cannot handle novel or unseen entities that emerge after training. (affects: Entities as Experts (EAE), Mixture-of-Word-Experts (MoWE))
Potential fix: Dynamic entity discovery and embedding expansion during inference, or periodic vocabulary updates with incremental training. - Scalability of memory hops: Multi-hop memory reading (MemN2N) increases computational cost linearly with hop count, and determining the optimal number of hops for a given query remains an open problem. (affects: End-to-End Memory Networks (MemN2N))
Potential fix: Adaptive hop termination mechanisms that dynamically decide when sufficient evidence has been gathered. - Offline augmentation bottleneck: Pensieve's approach requires pre-processing all memories with OCR, captioning, and metadata extraction, which may not scale to continuously growing personal memory stores in real time. (affects: Pensieve (Task-Oriented Memory Augmentation and Retrieval))
Potential fix: Incremental and streaming augmentation pipelines that process new memories as they arrive rather than in batch. - Evaluation on narrow benchmarks: Most methods are evaluated on a small number of QA benchmarks (TriviaQA, bAbI, MemoryQA), making it unclear how well sparse memory approaches generalize to open-ended or conversational settings. (affects: Entities as Experts (EAE), End-to-End Memory Networks (MemN2N), Mixture-of-Word-Experts (MoWE), Pensieve (Task-Oriented Memory Augmentation and Retrieval))
Potential fix: Development of diverse, multi-domain evaluation suites that test sparse memory QA across conversational, long-form, and multi-turn settings.
📚 View major papers in this topic (5)
- End-To-End Memory Networks (2015-03) 9
- Entities as Experts: Sparse Memory Access with Entity Supervision (2020-11) 8
- Memory-QA: Answering Recall Questions Based on Multimodal Memories (2025-11) 7
- Memory Augmented Language Models through Mixture of Word Experts (2024-07) 7
- Memory-Augmented Large Language Models for Knowledge-Intensive Tasks (2023-09) 4
💡 When memory stores grow from sparse collections to dense, continuously captured streams of daily photos, calendar entries, and activity logs, the retrieval challenge inverts—Dense Memory QA must distinguish the correct memory from a sea of near-duplicates rather than hunting for scattered fragments.
Dense Memory QA
What: Dense Memory QA addresses question answering over large, highly similar personal memory stores—such as daily photos, calendar entries, and activity logs—where memories overlap significantly and contain many near-duplicates.
Why: As personal devices continuously capture vast streams of multimodal data, users need accurate answers to recall questions (e.g., 'What did I eat last Tuesday?'), but the sheer density and similarity of stored memories makes retrieval and reasoning extremely challenging.
Baseline: Standard multimodal RAG systems retrieve memories by semantic similarity alone and feed them to a language model, but they fail to exploit temporal/spatial signals and cannot handle noise from highly similar irrelevant memories or perform aggregation across heterogeneous data types.
- Retrieving the right memories when many entries are near-duplicates with only subtle temporal or spatial differences
- Leveraging vague temporal and location anchors (e.g., 'last week', 'at the mall') that require specialized parsing beyond semantic similarity
- Aggregating information across multiple heterogeneous memory sources (images, tables, text logs) that may exceed context limits
- Filtering out retrieval noise—irrelevant but semantically similar memories—without discarding genuinely useful context
🧪 Running Example
Baseline: A standard RAG system retrieves the top-k memories by semantic similarity to 'coffee Starbucks'. Because the user visits Starbucks daily, many near-identical receipts and photos are returned, often exceeding context limits or including irrelevant tea orders. The model cannot reliably count distinct events or resolve 'last month' to exact dates, producing an inaccurate answer.
Challenge: The user's memory store contains hundreds of highly similar Starbucks entries (receipts, photos, location check-ins). Many are near-duplicates from different days. The system must resolve 'last month' to a precise date range, deduplicate across modalities, and perform an aggregation (counting) that pure retrieval-then-generate pipelines struggle with.
📈 Overall Progress
Research has shifted from generic semantic retrieval to specialized pipelines that exploit temporal/spatial signals and recursive decomposition for dense personal memory QA.
💡 Key Insights
💡 Semantic similarity alone is insufficient for dense memory retrieval; temporal and spatial signals are essential.
💡 Offline metadata augmentation enables text-based reasoning that matches expensive vision-language model performance.
💡 Noise-injected training makes answer generators robust to irrelevant but semantically similar retrieved memories.
💡 Recursive question decomposition bridges the gap between structured SQL queries and unstructured text retrieval.
💡 Distillation to small on-device models enables private personal QA without sending data to cloud services.
💡 Dense personal data demands hybrid operators that combine retrieval, extraction, and aggregation in a unified pipeline.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
In 2025, two complementary directions emerged: multi-signal retrieval with noise-robust generation for multimodal recall questions (Pensieve), and recursive analytical decomposition for complex queries over massive heterogeneous personal data (ReQAP). Both approaches recognize that standard RAG is insufficient for dense personal memory stores.
- (ReQAP, 2025) introduced recursive question decomposition with hybrid RETRIEVE and EXTRACT operators, enabling complex aggregation queries over 100K+ token heterogeneous personal data archives while supporting distillation to small on-device models
- (Pensieve, 2025) proposed task-oriented memory augmentation and multi-signal retrieval combining temporal, spatial, and semantic scoring, achieving up to 14% accuracy improvement over standard MM-RAG on the MemoryQA benchmark
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Task-Oriented Memory Augmentation and Multi-Signal Retrieval | Enrich memories with structured text metadata offline and retrieve using multiple explicit signals (time, location, semantics) rather than semantic similarity alone. | Standard multimodal RAG that relies solely on embedding-based semantic retrieval and expensive VLMs for visual reasoning | Memory-QA (2025) |
| Recursive Question Decomposition over Heterogeneous Data | Recursively break complex questions into a tree of retrieval, extraction, and aggregation operators that jointly handle structured and unstructured personal data. | Standard Text-to-SQL (which cannot handle unstructured text) and standard RAG (which cannot perform aggregations or handle 100K+ token archives) | Recursive Question Understanding for Complex... (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MemoryQA | QA Accuracy | +14% over SOTA MM-RAG | Memory-QA (2025) |
| PerQA | Accuracy on complex aggregation tasks | Significant improvement over Text-to-SQL baselines | Recursive Question Understanding for Complex... (2025) |
⚠️ Known Limitations (3)
- Reliance on metadata quality: Pensieve's augmentation pipeline depends on OCR and LLM-generated captions being accurate; errors in metadata propagate to retrieval and answer generation, particularly for low-quality images or ambiguous visual content. (affects: Pensieve)
Potential fix: Incorporating confidence scores for augmented metadata and falling back to visual reasoning for low-confidence entries. - Scalability to very large personal archives: While ReQAP uses cascade pruning, recursive decomposition over tens of thousands of events may still face latency and cost challenges on resource-constrained devices. (affects: ReQAP)
Potential fix: Further pruning strategies, indexing, and caching intermediate decomposition results to reduce per-query computation. - Limited evaluation scope: Each method is tested on its own benchmark (MemoryQA, PerQA) with no cross-evaluation, making it hard to compare their relative strengths on the same tasks or data distributions. (affects: Pensieve, ReQAP)
Potential fix: Developing unified benchmarks for dense personal memory QA that span multimodal recall and complex analytical queries.
📚 View major papers in this topic (2)
💡 Effective recall from stored memory is necessary but not sufficient for autonomous agents—agents must not only retrieve relevant past experiences but actively use them to plan, coordinate with other agents, and continuously improve their behavior, which motivates the specialized memory architectures explored in Memory for Agentic Systems.
Memory for Agentic Systems
What: This topic covers memory systems designed for LLM-based agents that enable persistent state, experience accumulation, and adaptive behavior across interactions, going beyond the single context window.
Why: Without memory, LLM agents are stateless—they repeat mistakes, forget user preferences, and cannot learn from experience, severely limiting their utility for long-running, real-world tasks.
Baseline: The baseline approach treats the LLM's context window as the sole memory, stuffing all prior interactions and instructions into the prompt, leading to context overflow, information loss, and quadratic cost scaling.
- Context windows are finite and expensive, yet agents must reason over arbitrarily long interaction histories
- Memory must evolve over time without accumulating errors from hallucinations, drift, or adversarial poisoning
- Selecting what to remember, forget, or consolidate requires balancing relevance, recency, and cost
- Memory systems introduce new attack surfaces where adversaries can inject or corrupt stored knowledge
🧪 Running Example
Baseline: A stateless LLM has no memory of prior conversations. It asks the user to re-specify all preferences from scratch, ignores that the partner recently became vegan (mentioned 3 months ago), and cannot recall the budget discussed last week. The context window approach might try to stuff all past conversations into the prompt, but this exceeds token limits and becomes prohibitively expensive.
Challenge: The assistant must retrieve specific facts (vegan diet, Japanese cuisine preference, budget) from different past sessions, resolve conflicts (the partner was vegetarian before but switched to vegan recently), and ignore irrelevant memories (past discussions about lunch spots), all while keeping context costs manageable.
📈 Overall Progress
Agent memory has evolved from simple context stuffing to OS-inspired hierarchical systems with formal governance, language-level safety guarantees, and learned eviction policies.
📂 Sub-topics
Memory Architecture & Frameworks
7 papers
Core architectural patterns for agent memory, including memory hierarchies, hybrid storage systems, and language-level primitives for persistent state management.
Context Engineering & Optimization
4 papers
Methods for efficiently managing, compressing, and adaptively curating the information environment in which agents operate, treating context as a scarce resource.
Memory Security & Safety
4 papers
Threats, vulnerabilities, and governance frameworks for agent memory systems, including injection attacks, intent legitimation, and safety-governed memory evolution.
Experience Accumulation & Workflow Learning
4 papers
Methods enabling agents to extract reusable knowledge from past interactions, including workflow induction, recursive processing, and simulation-based memory.
💡 Key Insights
💡 Context windows are cache, not memory—treating them as infinite storage wastes over 20% of tokens on structural overhead.
💡 Memory injection attacks succeed at 98% rates through normal queries alone, requiring no privileged access to the memory store.
💡 Reusable workflow extraction from past trajectories yields 50%+ success rate improvements over solving tasks from scratch.
💡 Personalization memory creates safety vulnerabilities: benign retrieved memories can increase attack success rates by up to 243%.
💡 Recursive self-invocation allows LLMs to process inputs two orders of magnitude beyond their native context window limits.
💡 Formal memory governance with ground-truth anchoring is essential to prevent compounding errors from hallucination drift.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Research progressed from benchmarking memory limitations (2024) through scaling and security analysis (2025) to production-grade systems treating context as a managed OS resource with formal safety guarantees (2026).
- (LoCoMo, 2024) established the first very long-term conversational memory benchmark (300+ turns), revealing that even frontier LLMs lag 56–73% behind humans on memory tasks
- (AWM, 2024) introduced workflow induction from agent trajectories, achieving +51.1% success rate improvement on WebArena through reusable parameterized action templates
- (AI Persona, 2024) redefined user profiles as dynamic learnable dictionaries continuously updated by an LLM-based optimizer rather than static RAG stores
- (RLMs, 2025) enabled processing inputs two orders of magnitude beyond context limits through recursive self-invocation, outperforming GPT-5 by 28.4%
- (Generative Agents, 2025) scaled memory-equipped agents to 1,000 individuals, achieving 0.85 correlation with human survey responses
- (MINJA, 2025) demonstrated query-only memory injection attacks with 98.2% success rate, exposing critical vulnerabilities in agent memory stores
- (ACE, 2025) introduced structured bullet-based context management with role decomposition, achieving +10.6% on agent benchmarks with 86.9% less latency
- (Memoria, 2025) combined SQL-based short-term logs with a recency-weighted knowledge graph for scalable personalized conversational memory
- (Pichay, 2026) applied OS virtual memory principles to LLM context, reducing context consumption by 93% in production with only 0.025% page fault rate
- (Turn, 2026) introduced a compiled language with memory isolation and typed inference as first-class primitives, enabling a multi-agent system in 89 lines of code
- (SSGM, 2026) formalized memory governance by decoupling memory evolution from verification, introducing ground-truth anchoring against semantic drift
- (Agent-Omit, 2026) trained agents via RL to adaptively omit redundant thoughts and observations, matching frontier model accuracy at 8B parameter scale
- (PS-Bench, 2026) revealed that personalization increases attack success rates by up to 243.7%, demonstrating that memory-enhanced safety requires new benchmarks
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Demand Paging for Context | Apply operating-system virtual memory principles—demand paging, eviction policies, and fault-driven pinning—to manage LLM context as a scarce cache resource. | Naive context stuffing, where all tool definitions, system prompts, and conversation history permanently occupy context regardless of usage | The Missing Memory Hierarchy: Demand... (2026) |
| Agentic Context Engineering | Decompose context management into granular bullets with role-separated adaptation and deterministic grow-and-refine merging to prevent information loss during updates. | Full-rewrite context adaptation methods like GEPA and Dynamic Cheatsheet that suffer from brevity bias and context collapse | Agentic Context Engineering (2025), Context Engineering (2026) |
| Agent Workflow Memory | Induce parameterized workflow templates from successful trajectories so agents can reuse proven strategies instead of solving every task from scratch. | Agents that solve each task independently without learning from prior experience, such as baseline ReAct-style approaches | Agent Workflow Memory (2024) |
| Recursive Language Models | Let the LLM programmatically decompose and recursively process long inputs via a code environment, treating itself as a callable function. | Vanilla long-context models that suffer from 'context rot' and compaction methods that lose critical details | Recursive Language Models (2025) |
| Cognitive Type Safety | Make memory isolation and context management compiler-enforced language invariants rather than fragile library conventions. | Framework-based approaches in Python/Rust where memory isolation, context bounds, and schema validation are application-level conventions prone to silent failures | Turn (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| WebArena | Success Rate | +51.1% relative improvement | Agent Workflow Memory (2024) |
| LoCoMo (Very Long-Term Conversational Memory) | Accuracy (QA tasks) | 44% of human performance on memory QA | Evaluating Very Long-Term Conversational Memory... (2024) |
| OOLONG / OOLONG-Pairs (Long-Context Processing) | F1 Score | 58.0% F1 on OOLONG-Pairs | Recursive Language Models (2025) |
⚠️ Known Limitations (5)
- Memory systems lack standardized evaluation benchmarks, making it difficult to compare approaches or measure progress toward human-level memory capabilities (existing benchmarks show 56–73% gaps vs. humans). (affects: LoCoMo, Agent Workflow Memory, Generative Agent Architecture)
Potential fix: The MemoryArena and LoCoMo benchmarks represent early steps; the field needs unified benchmarks spanning episodic, semantic, and procedural memory. - Persistent memory introduces novel attack surfaces where adversaries can inject malicious memories through normal interactions, and current defenses are insufficient against progressive injection strategies. (affects: MINJA, Memoria, AI Persona)
Potential fix: The SSGM framework proposes decoupling memory evolution from governance with verification protocols and ground-truth anchoring against an immutable observation ledger. - Memory consolidation and summarization can cause 'semantic drift'—repeated compression cycles gradually distort or lose critical details, leading to knowledge corruption over time. (affects: Agentic Context Engineering (ACE), SSGM Framework, AI Persona)
Potential fix: ACE uses deterministic grow-and-refine merging instead of full LLM rewrites; SSGM uses immutable observation ledgers for periodic reconciliation. - Most memory architectures are evaluated on specific task domains (web navigation, conversation) and lack evidence of generalization across diverse agent applications and deployment environments. (affects: Agent Workflow Memory, Adaptive Omission (Agent-Omit), Demand Paging (Pichay))
Potential fix: Cross-domain evaluation like AWM's Mind2Web cross-domain tests and Agent-Omit's 5-benchmark evaluation represent initial efforts toward demonstrating generalization. - Enterprise governance for memory-equipped agents remains immature—75% of enterprises plan agent deployment within two years but only 21% have mature governance models for managing persistent agent state. (affects: Context Engineering Pyramid, SSGM Framework)
Potential fix: The Pyramid of Agent Engineering and MAESTRO framework provide maturity models and layered threat analysis, but operational tooling remains sparse.
📚 View major papers in this topic (10)
- The Missing Memory Hierarchy: Demand Paging for LLM Context Windows (2026-03) 9
- Turn: A Language for Agentic Computation (2026-03) 9
- Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers (2026-03) 9
- Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models (2025-10) 9
- Recursive Language Models (2025-01) 9
- Generative Agent Simulations of 1,000 People (2025-02) 9
- Context Engineering: From Prompts to Corporate Multi-Agent Architecture (2026-03) 9
- Agent Workflow Memory (2024-09) 8
- Memory Injection Attacks on LLM Agents via Query-Only Interaction (2025-03) 8
- Evaluating Very Long-Term Conversational Memory of LLM Agents (2024-02) 8
💡 To ground the discussion of agent memory in concrete design principles, we begin with Agentic Memory Architecture, which establishes the structural patterns for integrating episodic, semantic, and procedural memory modules into LLM-based agent systems.
Agentic Memory Architecture
What: Agentic Memory Architecture covers design patterns and frameworks for integrating memory modules—such as episodic recall, semantic knowledge stores, and experience banks—into LLM-based agent systems, enabling them to retain, retrieve, and reason over past interactions.
Why: Without structured memory, LLM agents treat every task in isolation, repeating past errors and failing to personalize or adapt over time. Memory architectures are essential for building agents that learn continuously and behave reliably across long-running sessions.
Baseline: The conventional approach is to rely on fixed-length context windows or simple retrieval-augmented generation (RAG), where raw interaction logs are stored in a vector database and retrieved verbatim at inference time, without any structured distillation or cognitive organization.
- Memory starvation and context flooding: agents either lose critical information as context windows overflow, or are overwhelmed by irrelevant retrieved content
- Shallow personalization: most systems mimic surface-level style rather than capturing latent user beliefs, preferences, and reasoning patterns
- Silent cognitive degradation: internal failures such as planner recursion and memory drift accumulate over time without triggering any explicit error signals
- Experience distillation: extracting generalizable reasoning strategies from raw interaction trajectories rather than simply storing logs
🧪 Running Example
Baseline: A standard RAG-based agent retrieves the most recent conversation snippets by embedding similarity. It returns generic counter-arguments to UBI without reflecting the user's established ideological lens or preferred argumentation depth, producing a response that feels impersonal and shallow.
Challenge: The agent must distinguish between episodic details (specific past debates the user had) and semantic traits (the user's core beliefs and reasoning style), while avoiding memory drift where hallucinated content from earlier sessions contaminates future responses.
📈 Overall Progress
Agent memory has evolved from flat retrieval buffers to cognitively-inspired architectures that separate memory types, distill reasoning strategies, and monitor for degradation.
💡 Key Insights
💡 Semantic memory (abstracted beliefs) outperforms episodic memory (raw recall) for robust user personalization.
💡 Agents suffer silent cognitive degradation from internal failures, not just external adversarial attacks.
💡 Distilling structured reasoning from past trajectories yields compounding performance gains over time.
💡 Memory-aware test-time scaling enables diverse exploration that improves both accuracy and efficiency.
💡 Cross-session memory poisoning is a real threat where hallucinated content persists across agent interactions.
💡 Cognitively-inspired memory separation mirrors human dual-memory systems and improves agent adaptability.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Research in mid-2025 converged on three complementary fronts: cognitive memory organization for personalization, resilience frameworks against internal memory failures, and experience distillation for continuous agent learning—collectively moving the field beyond simple RAG toward structured, self-improving memory systems.
- (PRIME, 2025) introduced a cognitive dual-memory framework that separates episodic and semantic memory for LLM personalization, demonstrating that semantic memory instantiations are more robust than episodic approaches for capturing user traits
- (QSAF, 2025) formalized Cognitive Degradation as a vulnerability class in agentic AI, identifying critical failures like planner entrapment and cross-session memory poisoning across LLaMA3, Mixtral, Claude, and ChatGPT
- (ReasoningBank, 2025) proposed memory-driven experience scaling with MaTTS, achieving +8.3% success rate on WebArena and +34.2% relative improvement on WebArena-Shopping through structured reasoning distillation
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Cognitive Dual-Memory Framework | Split personalization memory into episodic recall and semantic belief modeling, then use self-distilled reasoning traces to align outputs with internalized user traits. | Fragmented personalization approaches (retrieval-only or fine-tuning-only) that capture surface-level style rather than latent user beliefs | PRIME (2025) |
| Memory-Driven Experience Scaling | Extract structured reasoning strategies from both successes and failures, then use them at test time to guide diverse solution exploration with contrastive feedback. | Memory-free agents that treat every task in isolation and standard experience replay that stores raw logs without distillation | ReasoningBank (2025) |
| Cognitive Degradation Resilience | Model internal agent failures as a formal cognitive degradation lifecycle and deploy runtime behavioral controls that detect and mitigate silent drift before it causes system collapse. | Traditional external threat defenses (prompt injection filters) that ignore internally originating agent failures | QSAF (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| WebArena | Success Rate | +8.3% over memory-free baseline | ReasoningBank (2025) |
| WebArena-Shopping (with MaTTS parallel scaling k=5) | Success Rate | +34.2% relative improvement | ReasoningBank (2025) |
| Change My View (CMV) | Personalization Accuracy | Best among all Semantic Memory instantiations | PRIME (2025) |
⚠️ Known Limitations (4)
- Scalability of memory distillation: extracting and indexing structured reasoning strategies from large trajectory histories becomes computationally expensive as the number of interactions grows, potentially limiting deployment in high-throughput settings. (affects: ReasoningBank + MaTTS)
Potential fix: Hierarchical memory compression and periodic consolidation of older strategies into more abstract summaries, similar to how human memory consolidates during sleep. - Memory poisoning and drift: hallucinated or incorrect content stored in vector memory can propagate across sessions, corrupting the agent's knowledge base without any explicit error signal, making it extremely difficult to detect and correct. (affects: Cognitive Degradation Resilience (QSAF), Cognitive Dual-Memory Framework (PRIME))
Potential fix: QSAF proposes runtime behavioral controls that monitor entropy drift and trigger fallback logic; provenance tracking and periodic memory auditing could further mitigate this. - Benchmark coverage for personalization: current benchmarks like CMV focus on debate-style persuasion, but real-world personalization spans diverse domains (e.g., shopping, coding assistance, healthcare), and it remains unclear how well dual-memory approaches generalize. (affects: Cognitive Dual-Memory Framework (PRIME))
Potential fix: Developing multi-domain personalization benchmarks that test latent belief modeling across diverse task types rather than a single genre. - Lack of standardized evaluation for resilience: there is no widely accepted benchmark for measuring cognitive degradation or silent drift in agents, making it difficult to compare resilience frameworks objectively. (affects: Cognitive Degradation Resilience (QSAF))
Potential fix: Community-developed stress-test suites that systematically induce memory starvation, context flooding, and planner recursion under controlled conditions.
📚 View major papers in this topic (3)
💡 Once an agent's memory architecture is established, the next question is how agents can actively learn from what they store, which is the focus of Experience Replay and Reflection—mechanisms for revisiting past execution traces, distilling lessons from successes and failures, and continuously improving future performance.
Experience Replay and Reflection
What: Experience replay and reflection encompasses mechanisms that enable AI agents to learn from past interactions by storing, retrieving, and reasoning over prior execution traces, successes, and failures.
Why: Without the ability to learn from accumulated experience, agents repeatedly make the same mistakes, waste computation rediscovering known solutions, and fail to improve over time—mirroring a worker who never keeps notes.
Baseline: Conventional agents use static pipelines with no persistent memory: each new task is approached from scratch, relying solely on the model's pretrained knowledge and in-context examples without any history of past trials.
- Catastrophic forgetting: replaying old experiences can interfere with learning new tasks, requiring careful scheduling of what and when to replay
- Memory scalability: storing full execution traces grows prohibitively expensive; agents must selectively curate which experiences to retain
- Cross-system knowledge transfer: experiences captured in one agent framework are typically incompatible with another, preventing collective learning
- Reflection quality: self-reflection can be shallow or hallucinatory unless grounded in verifiable evidence from actual execution outcomes
🧪 Running Example
Baseline: A baseline agent without experience replay starts from scratch on each competition, re-trying hyperparameter combinations and data-processing strategies that already failed in earlier tasks. It cannot recall that a similar feature-engineering approach worked well in competition #2.
Challenge: The agent must balance retaining useful insights from all five prior competitions (stability) while adapting to the new competition's unique requirements (plasticity), without its memory growing unboundedly.
📈 Overall Progress
Experience replay has evolved from isolated per-task memory to persistent, cross-framework knowledge systems that enable agents to continuously self-improve across runs and architectures.
💡 Key Insights
💡 Agents that persist and retrieve past experiences dramatically outperform stateless systems that start each task from scratch.
💡 Cognitive science principles like spaced repetition and forgetting curves transfer effectively to LLM continual learning.
💡 Cross-framework memory sharing unlocks collective intelligence that no single agent architecture can achieve alone.
💡 Structured reflection integrated into the reasoning loop is far more effective than post-hoc self-correction patches.
💡 Separating memory into distinct stores (ideation vs. experimentation, short-term vs. long-term) improves both plasticity and stability.
💡 Selective memory retrieval—feeding only relevant experiences—prevents context overflow and reasoning disruption.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (mid-2025) focused on structuring how agents store and retrieve past experiences—via tree-structured exploration, cross-framework schemas, and reflection loops. By early 2026, the field shifted toward principled replay scheduling inspired by cognitive science and multi-agent evolution with dual memory systems.
- (ML-Master, 2025) reformulated AI development as Monte Carlo Tree Search with adaptive memory, achieving 29.3% medal rate on MLE-Bench—surpassing the prior best of 22.4%
- (AGENTKB, 2025) introduced a universal cross-framework memory layer enabling +18.7pp improvement on GAIA and +17.0pp on SWE-bench Lite through collective experience sharing
- (Reflection-Driven, 2025) elevated self-reflection to a first-class internal control circuit with Plan–Reflect–Verify, grounding corrections in verified past repairs
- (EvoScientist, 2026) introduced dual persistent memories with an Evolution Manager for scientific discovery, achieving 100% paper acceptance at ICAIS 2025 including Best Paper Award
- (MSSR, 2026) modeled per-sample memory strength via Ebbinghaus forgetting curves for continual LLM fine-tuning, outperforming baselines across 3 backbone models on 11-task sequences
- (ARROW, 2026) combined short-term and long-term replay buffers with reservoir sampling for continual RL, achieving 4x less forgetting on Atari benchmarks
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Persistent Memory Evolution | Agents maintain evolving long-term memories that are continuously updated with distilled insights from every interaction, so future runs start from an increasingly informed baseline. | Static agent pipelines that treat each run independently with no cross-run learning | EvoScientist (2026), ML-Master (2025) |
| Adaptive Replay Scheduling | Model each sample's forgetting risk and prioritize replay for rapidly fading knowledge, mirroring how human memory benefits from spaced repetition. | Fixed-interval and random replay strategies that waste compute on already-consolidated knowledge or react too late to forgetting | MSSR (2026), ARROW (2026) |
| Cross-Framework Experience Transfer | Unify execution traces from incompatible agent frameworks into a shared memory layer so agents can learn from collective experience across systems. | Framework-specific memory systems where knowledge is trapped within individual agent architectures | AGENTKB (2025) |
| Reflection-Driven Control | Make reflection an explicit, structured step in the agent's generation loop rather than an afterthought, using verified past repairs to ground self-correction. | Post-hoc safety patches and unstructured self-reflection that lack integration into the agent's internal reasoning process | Reflection-Driven (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MLE-Bench | Average Medal Rate | 29.3% | ML-Master (2025) |
| GAIA | Pass@3 | 73.9% | AGENTKB (2025) |
| SWE-bench Lite | Pass@100 | 45.7% | AGENTKB (2025) |
⚠️ Known Limitations (4)
- Memory curation overhead: deciding what to store, when to forget, and how to index experiences adds significant engineering complexity and computational cost, especially as interaction histories grow. (affects: Persistent Memory Evolution, Adaptive Replay Scheduling, Cross-Framework Experience Transfer)
Potential fix: Reservoir sampling and memory-strength modeling can automate curation, but optimal forgetting policies remain an open problem. - Reflection hallucination: when self-reflection is not grounded in verifiable execution outcomes, agents may generate plausible but incorrect diagnoses of their failures, compounding errors. (affects: Reflection-Driven Control, Persistent Memory Evolution)
Potential fix: Grounding reflection in dual-layer memory (dynamic past repairs + static standards) and routing only risky outputs through reflection, as proposed by Reflection-Driven Control. - Cross-framework schema fragility: abstracting execution traces into a universal schema risks losing framework-specific details that are critical for reproducing successful workflows in their original context. (affects: Cross-Framework Experience Transfer)
Potential fix: The disagreement gate in AGENTKB partially addresses this by filtering out retrieved knowledge that conflicts with the agent's current reasoning, but richer schema representations may be needed. - Evaluation generalizability: most methods are evaluated on specific benchmark suites (Atari, MLE-Bench, GAIA), and their effectiveness in open-ended, real-world deployment scenarios remains largely unvalidated. (affects: Adaptive Replay Scheduling, Persistent Memory Evolution, Cross-Framework Experience Transfer)
Potential fix: Broader evaluation across diverse task distributions and longer time horizons would strengthen confidence in these methods' practical utility.
📚 View major papers in this topic (6)
- EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery (2026-03) 9
- AGENTKB: LEVERAGING CROSS-DOMAIN EXPERIENCE FOR AGENTIC PROBLEM SOLVING (2025-07) 9
- ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning (2025-06) 8
- MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning (2026-03) 7
- ARROW: Augmented Replay for RObust World models (2026-03) 7
- Reflection-Driven Control for Trustworthy Code Agents (2025-12) 7
💡 The refined strategies and procedural knowledge distilled through experience replay naturally feed into Memory-Augmented Planning, where agents retrieve and leverage their accumulated experiences to make better decisions and execute multi-step plans more efficiently.
Memory-Augmented Planning
What: Memory-augmented planning studies how AI agents can store, retrieve, and leverage past experiences—such as solved tasks, interaction histories, and procedural skills—to improve future planning and decision-making, rather than reasoning from scratch each time.
Why: Without persistent memory, agents waste computation rediscovering solutions to previously encountered problems and cannot transfer hard-won knowledge across tasks, domains, or frameworks. Memory-augmented planning is essential for building agents that improve over time and operate efficiently in real-world settings.
Baseline: Baseline agents treat each task independently, relying solely on the current prompt context and tool documentation. They have no mechanism to recall prior successes or failures, leading to repeated mistakes and inefficient exploration.
- Deciding what to remember: filtering useful experience from noise without exceeding storage or retrieval budgets
- Transferring knowledge across heterogeneous agent frameworks and domains without introducing conflicting or stale information
- Balancing short-term task-specific context with long-term generalizable knowledge to avoid overfitting to past solutions
- Retrieving relevant memories at the right time during multi-step planning under latency and compute constraints
🧪 Running Example
Baseline: A standard agent reads the error log and attempts to resolve the conflict from scratch. It may try several incorrect approaches (e.g., pinning the wrong version, removing a needed dependency) because it has no memory of how similar conflicts were resolved previously, wasting multiple iterations.
Challenge: The solution exists in execution traces from a different agent framework (e.g., OpenHands solved a similar conflict last month), but that knowledge is siloed. Even within the same framework, the agent's previous successful fix was lost when the context window cleared.
📈 Overall Progress
Memory-augmented planning has evolved from biologically inspired dual-memory learning to universal cross-framework experience sharing and deployment-ready predictive memory systems.
📂 Sub-topics
Cross-Framework Experience Transfer
3 papers
Methods for abstracting, storing, and reusing agent execution traces and procedural knowledge across different agent architectures and task domains.
Dual-Memory Learning Architectures
2 papers
Approaches that separate agent memory into short-term (within-episode) and long-term (cross-episode) stores, inspired by biological memory systems, to balance exploration depth with experience breadth.
Predictive and Domain-Structured Memory
4 papers
Systems that organize memory around domain-specific structures (clinical records, social interaction histories, hardware profiles) or use predictive pre-fetching to reduce retrieval latency during planning.
💡 Key Insights
💡 Cross-framework memory sharing yields double-digit accuracy gains by breaking knowledge silos between agent architectures.
💡 Biologically inspired dual memory (short-term + long-term) enables small models to outperform GPT-4 on tool use.
💡 Self-generated skills often degrade performance; curated skill libraries are significantly more reliable.
💡 Predictive memory pre-fetching can reduce retrieval latency by over 300x for real-time voice applications.
💡 Domain-structured memory (e.g., clinical document trees) outperforms generic vector stores in safety-critical settings.
💡 Standardized evaluation protocols are essential—unstandardized agent comparisons produce high-variance, unreproducible results.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (2024) focused on giving individual agents short-term and long-term memory inspired by human cognition. By mid-2025, the field shifted toward breaking down memory silos across agent frameworks with universal experience layers. The latest work (2026) emphasizes formalization of reusable procedural skills, domain-structured memory for safety-critical deployments, and latency-optimized memory access for real-time applications.
- (STE, 2024) introduced biologically inspired simulated trial-and-error with short-term and long-term memory, enabling a 7B parameter model to surpass GPT-4 on tool-use accuracy (76.8% vs 60.8%)
- (MAGS, 2025) extended dual-memory to multi-agent feature engineering, using a Router agent with short-term trajectory refinement and long-term demonstration retrieval
- (OAgents, 2025) demonstrated that modular adaptive memory with periodical plan revision achieves state-of-the-art on GAIA among open-source agent frameworks
- (AGENTKB, 2025) introduced a universal cross-framework memory layer, improving GAIA pass@3 by 18.7 percentage points and SWE-bench Lite pass@100 by 17.0 percentage points
- (Social-RAG, 2025) treated group interaction history as a social knowledge base, successfully deploying in 18 Slack channels with 500+ researchers
- (Agentic Skills, 2026) formalized skills as 4-tuple persistent memory modules with a 7-stage lifecycle, showing curated skills improve pass rates by 16.2 percentage points while self-generated skills can degrade performance
- (VoiceAgentRAG, 2026) introduced predictive memory pre-fetching with a dual-agent architecture, achieving 316x retrieval speedup on cache hits for voice AI
- (AOSH, 2026) replaced vector embeddings with Page-Indexed Memory for secure clinical agent deployment with least-privilege execution
- (HeRo, 2026) optimized agentic RAG memory access patterns on mobile SoCs, reducing end-to-end latency by up to 10.94x
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Cross-Framework Experience Replay | Unify agent experiences from heterogeneous frameworks into a single, framework-agnostic memory that any agent can query during planning. | Framework-specific memory systems that trap knowledge within individual agent architectures | AGENTKB (2025) |
| Simulated Trial and Error | Let agents learn tool use through simulated practice with biologically inspired short-term and long-term memory, rather than from static documentation alone. | Documentation-based tool learning and standard supervised fine-tuning on tool-use examples | LLMs (2024), Agentic Feature Augmentation (2025) |
| Reusable Agentic Skill Formalization | Formalize agent procedures as reusable, self-contained skill modules with explicit conditions for when and how to apply them. | Ad-hoc planning where agents re-derive execution strategies from scratch for every recurring task | SoK (2026), OAgents (2025) |
| Predictive Memory Pre-fetching | Use idle time during the current conversational turn to speculatively retrieve and cache documents the agent will likely need for future turns. | Standard synchronous RAG retrieval that blocks response generation with 50–300ms lookup latency | VoiceAgentRAG (2026) |
| Domain-Structured Memory Systems | Structure agent memory to mirror domain-specific information organization rather than relying on generic embedding-based retrieval. | Generic vector-embedding memory that lacks domain-aware organization and auditability | When OpenClaw Meets Hospital: Toward... (2026), Social-RAG (2025), HeRo (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| GAIA | Pass@3 accuracy | 73.9% | AGENTKB (2025) |
| Tool-Use Correctness (ToolBench) | Correctness percentage | 76.8% | LLMs (2024) |
| SkillsBench | Pass rate improvement (percentage points) | +16.2pp pass rate | SoK (2026) |
⚠️ Known Limitations (5)
- Self-generated skills and memories can encode incorrect heuristics, degrading performance rather than improving it. This matters because autonomous memory accumulation without quality control can compound errors over time. (affects: Reusable Agentic Skill Formalization, Cross-Framework Experience Replay)
Potential fix: Human curation of skill libraries, automated validation of stored experiences against ground truth, and disagreement gates that filter conflicting retrieved knowledge. - Memory retrieval can inject stale or conflicting information into an agent's planning loop, especially when experiences come from different domains or framework versions. This can cause the agent to pursue outdated strategies. (affects: Cross-Framework Experience Replay, Simulated Trial and Error (STE))
Potential fix: AGENTKB's disagreement gate filters conflicting knowledge, but general solutions for memory staleness detection and expiration remain underexplored. - Security and auditability concerns arise when agents have broad memory access in sensitive domains like healthcare. Unrestricted memory retrieval could leak private information or lead to unauthorized actions. (affects: Domain-Structured Memory Systems, Cross-Framework Experience Replay)
Potential fix: AOSH enforces least-privilege execution with restricted Linux namespaces and audit trails via document-mutation coordination, but this approach is domain-specific and not yet generalized. - Predictive pre-fetching relies on accurate topic prediction; cache misses fall back to full retrieval latency, and prediction errors waste compute on irrelevant documents. (affects: Predictive Memory Pre-fetching)
Potential fix: Improving prediction models with richer conversational context and maintaining hybrid retrieval strategies that gracefully degrade on cache misses. - Most memory-augmented planning systems are evaluated on specific benchmarks (GAIA, ToolBench) and lack evidence of generalization to truly open-ended, long-horizon real-world tasks. (affects: Cross-Framework Experience Replay, Simulated Trial and Error (STE), Reusable Agentic Skill Formalization)
Potential fix: Developing more diverse, long-horizon evaluation benchmarks and testing memory systems in production deployments over extended time periods.
📚 View major papers in this topic (7)
- AGENTKB: Leveraging Cross-Domain Experience for Agentic Problem Solving (2025-07) 9
- SoK: Agentic Skills — Beyond Tool Use in LLM Agents (2026-02) 9
- LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error (2024-03) 8
- OAgents: An Empirical Study of Building Effective Agents (2025-06) 8
- VoiceAgentRAG: Agentic RAG for Low-Latency Voice AI (2026-03) 8
- When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows (2026-03) 7
- Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories (2025-05) 7
💡 As individual agents become more capable planners through memory, the frontier extends to Multi-Agent Shared Memory—architectures and protocols that enable teams of agents to pool their knowledge, coordinate through a common memory layer, and collectively outperform what any single agent can achieve.
Multi-Agent Shared Memory
What: Multi-agent shared memory encompasses the architectures, protocols, and systems that allow multiple LLM-based agents to store, retrieve, and coordinate through a common knowledge layer, enabling collective intelligence beyond what any single agent can achieve.
Why: As LLM agents move from solo tools to collaborative teams, they need structured ways to share context, avoid conflicting actions, and accumulate experience—without such memory systems, multi-agent collaboration degrades into redundant, inconsistent work.
Baseline: The conventional approach gives each agent its own isolated context window or retrieval store, requiring explicit message-passing for every piece of shared information and offering no persistent cross-agent memory.
- Memory consistency: ensuring all agents see up-to-date, non-contradictory information when reading and writing concurrently
- Cross-framework interoperability: transferring experience and knowledge between agents built on different architectures and frameworks
- Access control and security: enforcing fine-grained permissions so agents only read or modify data they are authorized to access
- Scalable coordination: maintaining low latency and high throughput as the number of collaborating agents grows
🧪 Running Example
Baseline: With isolated memory, the treatment-planning agent cannot see the triage agent's latest notes without an explicit handoff message; if the triage agent updates an allergy list, the treatment agent may propose a contraindicated drug because its context is stale.
Challenge: The patient's record is updated by all three agents concurrently: the triage agent logs new vitals, the treatment agent adds medication orders, and the discharge agent drafts summaries—all must remain consistent, auditable, and access-controlled.
📈 Overall Progress
Multi-agent memory has evolved from ad-hoc isolated stores to formally structured, protocol-driven shared memory systems with consistency guarantees.
📂 Sub-topics
Memory Architecture and Consistency
2 papers
Frameworks that define how multi-agent memory is structured, layered, and kept consistent, drawing on principles from computer architecture such as cache hierarchies and coherence protocols.
Cross-Agent Context Protocols and Knowledge Transfer
2 papers
Standardized protocols and memory layers that enable diverse agents—potentially built on different frameworks—to share context, transfer experience, and build collective intelligence.
💡 Key Insights
💡 Memory consistency across agents is the most critical unsolved challenge, analogous to cache coherence in multiprocessor hardware.
💡 Cross-framework knowledge transfer yields large gains (+18.7pp on GAIA), proving collective memory outperforms isolated experience.
💡 Standardized context protocols eliminate brittle point-to-point integrations, enabling plug-and-play multi-agent collaboration.
💡 Safety-critical domains demand page-indexed, audit-trailed memory with least-privilege access rather than flat vector stores.
💡 Hardware memory hierarchy concepts (I/O, cache, persistent store) transfer effectively to organizing LLM agent memory.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (2025) focused on standardizing how agents share context through universal protocols and cross-framework knowledge bases. By 2026, the field shifted toward formalizing memory architectures with hardware-inspired hierarchies and deploying shared memory in safety-critical domains with strict access control.
- (MCP, 2025) introduced a standardized context-sharing protocol acting as 'USB-C for context,' decoupling memory management from individual agent logic
- (AgentKB, 2025) created a universal cross-framework memory layer, achieving +18.7pp improvement on GAIA (55.2% → 73.9%) over framework-isolated baselines
- (CA-Memory, 2026) formalized multi-agent memory as a three-layer hierarchy (I/O, Cache, Memory) and identified consistency as the most pressing open challenge
- (AOSH, 2026) deployed page-indexed memory with document-mutation coordination in a hospital agentic OS, enforcing least-privilege execution for clinical safety
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Three-Layer Memory Hierarchy | Apply hardware-inspired cache hierarchies and coherence protocols to give multi-agent systems structured, consistent shared memory. | Ad-hoc shared context stores that lack formal consistency guarantees and treat all memory accesses uniformly regardless of latency requirements | Multi-Agent (2026) |
| Document-Mutation Coordination | Agents share state by mutating structured documents rather than passing messages, combining navigable page-indexed memory with strict access-control isolation. | General-purpose agent frameworks that use flat vector stores and lack the security, auditability, and longitudinal memory required for safety-critical domains like healthcare | When OpenClaw Meets Hospital: Toward... (2026) |
| Model Context Protocol | A universal plug-and-play protocol that standardizes how agents access shared context, eliminating custom integrations between heterogeneous agent architectures. | Bespoke point-to-point integrations where each agent-to-agent or agent-to-tool connection requires custom code, making systems fragile and hard to scale | Advancing Multi-Agent Systems Through Model... (2025) |
| Universal Cross-Framework Memory Layer | Unify experience from multiple incompatible agent frameworks into a shared, framework-agnostic memory so agents never rediscover known solutions or repeat known mistakes. | Framework-specific memory systems that trap knowledge within individual agent architectures, preventing cross-system learning and collective intelligence | AGENTKB (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| GAIA | pass@3 | 73.9% | AGENTKB (2025) |
| SWE-bench Lite | pass@100 | 45.7% | AGENTKB (2025) |
| Humanity's Last Exam (Bio/Chem) | pass@3 | 14.1% | AGENTKB (2025) |
⚠️ Known Limitations (4)
- Memory consistency remains unsolved: no existing system guarantees that concurrent agent reads and writes produce conflict-free, up-to-date results, which can cause agents to act on stale or contradictory information. (affects: Three-Layer Memory Hierarchy, Document-Mutation Coordination (AOSH))
Potential fix: Adapting formal cache coherence protocols (e.g., MESI) from hardware to agent memory, with conflict detection and resolution mechanisms. - Evaluation is largely qualitative or domain-specific: most papers either provide no quantitative evaluation or test only in narrow domains, making it difficult to compare approaches or assess generalization. (affects: Three-Layer Memory Hierarchy, Model Context Protocol, Document-Mutation Coordination (AOSH))
Potential fix: Developing standardized multi-agent memory benchmarks that test consistency, latency, and coordination quality across diverse scenarios. - Scalability to large agent populations is untested: current systems demonstrate coordination among small numbers of agents, leaving open questions about performance degradation as agent counts grow to dozens or hundreds. (affects: Model Context Protocol, Universal Cross-Framework Memory Layer (AgentKB), Document-Mutation Coordination (AOSH))
Potential fix: Borrowing distributed-systems techniques such as sharding, replication, and eventual consistency to handle larger agent populations. - Security and access control add latency and complexity: enforcing least-privilege execution and audit trails in shared memory systems introduces overhead that may conflict with the low-latency requirements of real-time agent collaboration. (affects: Document-Mutation Coordination (AOSH), Three-Layer Memory Hierarchy)
Potential fix: Lightweight capability-based access control models that enforce permissions with minimal runtime overhead.
📚 View major papers in this topic (4)
- AGENTKB: Leveraging Cross-Domain Experience for Agentic Problem Solving (2025-07) 9
- Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead (2026-03) 8
- When OpenClaw Meets Hospital: Toward an Agentic Operating System for Dynamic Clinical Workflows (2026-03) 7
- Advancing Multi-Agent Systems Through Model Context Protocol: Architecture, Implementation, and Applications (2025-04) 7
💡 With memory systems spanning from individual agent architectures to multi-agent shared knowledge, Agent Memory Evaluation develops the benchmarks and metrics needed to assess whether agents can truly leverage accumulated memory to guide decisions across multi-session, interdependent tasks.
Agent Memory Evaluation
What: Agent Memory Evaluation covers benchmarks, evaluation frameworks, and metrics designed to assess how effectively AI agents acquire, retain, and use memory to guide future decisions across multi-session interactions.
Why: Existing benchmarks test memorization and action in isolation, failing to capture whether agents can actively leverage accumulated experience to solve progressively complex tasks—a critical capability for real-world deployment.
Baseline: Conventional evaluation either measures static recall accuracy (e.g., QA over past conversations) without requiring action, or tests single-session agent performance where long-term memory is unnecessary.
- Coupling memorization with action: evaluating whether recalled information actually improves downstream task completion, not just retrieval accuracy
- Designing interdependent multi-session tasks where later sessions are underspecified without memory from earlier sessions
- Scaling evaluation to long horizons (50+ action steps, 40k+ token traces) that stress both memory capacity and reasoning
🧪 Running Example
Baseline: A baseline agent without structured memory evaluation treats each session independently, failing to recall the user's loyalty program or room preferences, and either asks redundant questions or books a suboptimal hotel.
Challenge: The booking task is deliberately underspecified—critical constraints (loyalty chain, bed type, floor preference) were established in earlier sessions and must be distilled from accumulated experience rather than explicitly restated.
📈 Overall Progress
Evaluation of agent memory is shifting from passive recall accuracy to action-coupled task completion across interdependent multi-session settings.
💡 Key Insights
💡 Agents with near-perfect static memory recall perform poorly when memory must drive multi-session action.
💡 Evaluation must couple memorization with downstream task completion to reveal true memory capability gaps.
💡 Long-horizon tasks (57+ steps, 40k+ tokens) expose failures in maintaining latent task states across sessions.
💡 Four diverse domains (shopping, travel, search, reasoning) are needed to stress-test different memory usage patterns.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work in this area reveals a critical gap: agents that excel at static memory benchmarks struggle when memory must actively guide decisions in progressive, multi-session tasks, motivating a new generation of evaluation frameworks.
- (MemoryArena, 2026) introduced a benchmark evaluating agent memory through interdependent multi-session tasks across four domains, revealing that agents with near-saturated static memory scores perform poorly when memory must guide action
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Memory-Agent-Environment Loop Evaluation | Evaluate memory through progressive, interdependent tasks where correct recall is a prerequisite for successful action, not a standalone metric. | Static memory benchmarks that test recall in isolation (e.g., QA over past conversations) and single-session agent benchmarks that do not require long-term memory | Benchmarking Agent Memory in Interdependent... (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MemoryArena | Task Completion Rate | Low task completion rates | Benchmarking Agent Memory in Interdependent... (2026) |
⚠️ Known Limitations (3)
- Limited coverage of memory evaluation approaches: with only one benchmark paper, the landscape of evaluation methodologies remains underexplored, making it difficult to establish consensus on best practices for memory assessment. (affects: Memory-Agent-Environment Loop Evaluation)
Potential fix: Development of complementary benchmarks targeting different memory types (episodic, semantic, procedural) and interaction patterns. - Scalability of multi-session evaluation: tasks averaging 57 action steps and 40k+ token traces are computationally expensive to run, potentially limiting widespread adoption and iteration speed. (affects: Memory-Agent-Environment Loop Evaluation)
Potential fix: Developing tiered evaluation suites with lightweight proxy tasks for rapid iteration alongside full-scale benchmarks for comprehensive assessment. - Domain coverage: current evaluation spans four domains (shopping, travel, search, reasoning), but real-world agents operate in many more settings including coding, scientific research, and social interaction. (affects: Memory-Agent-Environment Loop Evaluation)
Potential fix: Extending the benchmark framework to additional domains and allowing community contributions of new task environments.
📚 View major papers in this topic (1)
💡 Beyond the core categories of organization, recall, and agentic memory, a rich collection of Other Topics addresses cross-cutting concerns—from LLM personalization and memory security to hardware in-memory computing and theoretical foundations—that collectively shape how memory is implemented and optimized across the full AI stack.
Other Topics
What: This topic encompasses papers that do not fit into the main memory categories but contribute to the broader memory landscape, spanning LLM personalization, efficient inference and training, continual learning, spatial memory for vision, hardware in-memory computing, memory security, and theoretical foundations of memory in neural systems.
Why: These diverse contributions collectively shape how memory is understood, implemented, and optimized across the AI stack—from silicon hardware to high-level agent cognition—filling critical gaps that no single core memory category addresses.
Baseline: Baseline approaches typically treat memory as monolithic: LLMs process full contexts indiscriminately, training uses standard backpropagation with full optimizer states, and hardware relies on the Von Neumann architecture with separate memory and compute units.
- Scaling memory efficiently: KV caches grow linearly with sequence length, optimizer states consume 2-3x model size, and hardware bandwidth lags behind compute scaling
- Balancing personalization with factual reliability: incorporating user history risks entangling preferences with facts, causing hallucinations aligned with user biases rather than truth
- Preventing catastrophic forgetting in sequential learning while maintaining plasticity for new tasks within bounded compute and memory resources
- Securing persistent memory against adversarial manipulation, where poisoned memories can trigger unauthorized actions in future sessions
🧪 Running Example
Baseline: A standard LLM either ignores the history entirely (generating a generic restaurant list) or naively retrieves all past mentions of food, flooding the context with irrelevant details and potentially exceeding the context window. If the user once mentioned disliking sushi in a joke, the system may incorrectly exclude all Japanese restaurants.
Challenge: The assistant must selectively retrieve relevant preferences (partner's dietary restrictions, budget, location), distinguish genuine preferences from casual mentions, handle evolved preferences (the partner recently became vegetarian), and do all this within a bounded KV cache without degrading response quality.
📈 Overall Progress
The field has evolved from treating memory as a passive storage layer to actively engineering it as a first-class system component—with theoretical guarantees, hardware co-design, and security considerations.
📂 Sub-topics
LLM Personalization
20 papers
Methods for tailoring LLM outputs to individual users via retrieval, embedding injection, reinforcement learning from interaction, and causal preference modeling.
Efficient LLM Inference
22 papers
Techniques for reducing inference cost including KV cache compression/eviction, structured pruning, speculative decoding, and sparse attention optimization.
Memory-Efficient Training
5 papers
Methods to reduce GPU memory consumption during LLM pre-training and fine-tuning, including gradient projection, layerwise sampling, and zeroth-order optimization.
Continual Learning & Forgetting Prevention
8 papers
Approaches to enable models to learn from sequential data streams without catastrophically forgetting previous knowledge, including information-theoretic frameworks, gated adaptation, and causal feature expansion.
Spatial Memory for Vision & 3D
7 papers
External memory architectures for maintaining 3D spatial consistency in video generation, world simulation, and robotic manipulation, inspired by biological working and episodic memory.
Hardware Memory & In-Memory Computing
10 papers
Physical memory technologies (memristors, phase-change memory, Processing-in-Memory) and analysis of the memory wall bottleneck for AI workloads.
Memory Evaluation & Benchmarks
7 papers
Benchmarks and evaluation frameworks for assessing memory capabilities of LLM agents, including factual recall, cognitive memory, preference tracking, and multi-turn consistency.
Memory Security & Adversarial Attacks
3 papers
Vulnerabilities in memory-augmented systems including poisoning attacks on RAG knowledge bases, hidden state corruption in SSMs, and context manipulation in agent memory.
Theoretical Foundations of Memory
8 papers
Fundamental theories connecting memory to attention mechanisms, position encoding, biological neural circuits, and information-theoretic principles.
💡 Key Insights
💡 Memory bandwidth, not compute, is the primary bottleneck for modern AI—scaling at 1.6x/2yrs vs 3.0x/2yrs for FLOPS.
💡 Frontier models achieve only ~50% on personalization tasks requiring evolving user tracking, barely above chance.
💡 Catastrophic forgetting has a provable information-theoretic bound: context channel capacity must exceed task entropy.
💡 Persistent agent memory is a critical attack surface—poisoned memories achieve >80% attack success on frontier models.
💡 Gradient low-rank projection enables 7B model pre-training on a single 24GB consumer GPU without sacrificing quality.
💡 The 'Lost in the Middle' attention bias is a geometric property at initialization, not a learned artifact of training.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Research has progressed from foundational theory (FFN-as-memory, memory wall analysis) through practical memory-efficient training breakthroughs (GaLore, LISA) to sophisticated memory-aware systems for personalization, spatial reasoning, and agent security, with 2026 bringing formal information-theoretic frameworks that unify previously disparate empirical findings.
- (FFN-as-KV-Memory, 2021) reinterpreted Transformer feed-forward layers as key-value memories, revealing that lower layers detect shallow patterns while upper layers encode semantic concepts
- (RowHammer, 2023) documented a decade of DRAM vulnerability research showing >80% of commodity DRAM modules are susceptible to read-disturbance bitflips
- (BB-LDPC, 2023) achieved quantum error correction protecting 12 logical qubits with only 288 physical qubits, a >10x overhead reduction over surface codes
- (DBE, 2023) proposed decoupling global and client-specific representations in federated learning, improving accuracy by up to 32.3%
- (GaLore, 2024) enabled pre-training LLaMA 7B on a single 24GB consumer GPU by projecting gradients into low-rank subspaces, reducing optimizer memory by 65.5%
- (LISA, 2024) outperformed LoRA by 11-38% on MT-Bench by randomly unfreezing layer subsets, achieving full-parameter quality at LoRA-level memory cost
- (AgentPoison, 2024) demonstrated that RAG-based agent memory can be poisoned via embedding space manipulation, achieving >80% attack success with <0.1% poison rate
- (MemoryWall, 2024) quantified the widening gap between compute scaling (3.0x/2yrs) and memory bandwidth scaling (1.6x/2yrs), establishing memory as the primary AI bottleneck
- (RLPA, 2025) formulated personalization as a multi-turn MDP with simulated users, outperforming SFT by 29 points and surpassing GPT-4o on generalization benchmarks
- Point3R (Point3R, 2025) introduced explicit spatial pointer memory with 3D-extended RoPE for streaming reconstruction, generalizing across 14 diverse datasets
- (MemoryVLA, 2025) added perceptual-cognitive memory to robotic VLA models with biological consolidation, achieving 26% improvement over CogACT on long-horizon tasks
- (MemSurvey, 2025) proposed a unified four-type memory taxonomy and layered evaluation framework, identifying systemic biases in automated memory evaluation
- (FakeMemories, 2025) demonstrated that memory injection attacks achieve >80% success rates on GPT-4o and Claude in Web3 agent scenarios
- (CCC, 2026) proved an Impossibility Triangle for continual learning and showed HyperNetworks achieve near-zero forgetting via high context capacity
- (LostMiddle, 2026) proved the U-shaped attention bias exists at initialization (before training), caused by iterated Cesàro matrix geometry rather than positional encodings
- (LongFlow, 2026) fused KV cache eviction directly into FlashAttention kernels, achieving 11.8x throughput with 80% cache reduction for reasoning models
- (RP-Reasoner, 2026) used Bayesian pragmatic reasoning to filter irrelevant memories, improving accuracy by 35% and resolving 80% of bad cases in production
- El Agente Gráfico (ElAgente, 2026) embedded LLM decision-making in type-safe execution graphs with Knowledge Graph persistence, reducing scientific agent costs by 96%
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| KV Cache Compression & Eviction | Predict which cached tokens the model will actually need for future generation, and evict the rest before or during decoding. | Full KV cache retention, which grows linearly with sequence length and dominates GPU memory during inference | LookaheadKV (2026), LongFlow (2026), InfLLM (2024) |
| Memory-Efficient Training via Gradient Projection | Gradients naturally become low-rank during training; projecting them into this subspace before the optimizer step reduces state memory without restricting the model's learning capacity. | Standard AdamW optimizer states (which consume 2-3x model size) and LoRA (which restricts parameters to a low-rank subspace) | GaLore (2024), LISA (2024) |
| Personalization via Retrieval-Augmented User Modeling | Treat user history as a queryable knowledge base, but compress and filter it intelligently so only preference-relevant context reaches the model. | Generic LLM outputs that ignore user preferences, and naive full-history concatenation that exceeds context limits | Persona-DB (2024), How Does Personalized Memory Shape... (2026), Integrating Summarization and Retrieval for... (2023) |
| Speculative Decoding Optimization | Replace sequential token-by-token generation with parallel draft-then-verify cycles, using the model's own structure as the drafter to avoid auxiliary model overhead. | Standard autoregressive decoding (one token per forward pass) and traditional speculative decoding (requires a separate trained draft model) | Speculative Streaming (2025), DynaSpec (2025), PLD+: Accelerating LLM inference by... (2024) |
| Spatial Memory for Consistent World Generation | Store an explicit 3D point cloud or geometry-indexed memory bank that serves as a persistent spatial reference, retrievable by current camera pose rather than appearance similarity. | Autoregressive video models with limited temporal context windows that forget previously generated scenes upon revisiting | Point3R (2025), WorldMem (2025), Memory Forcing (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| PersonaMem (Dynamic User Profiling) | Multiple-Choice Accuracy | ~50% | Know Me, Respond to Me (2025) |
| MT-Bench (LLM Quality after Fine-tuning) | MT-Bench Score | 11-38% improvement over LoRA | LISA (2024) |
| ∞-Bench (Long-Context Capability) | Average Score | 22.82% | InfLLM (2024) |
⚠️ Known Limitations (5)
- Personalization-factuality tension: incorporating user history often causes models to validate user misconceptions rather than stating objective truths, degrading factual reliability (affects: Retrieval-Augmented Personalization, Embedding-Based Persona Injection)
Potential fix: Factuality-Preserving Personalized Steering (FPPS) uses lightweight probes to detect entanglement and applies adaptive hidden-state steering to restore factuality when needed - Memory security vulnerability: persistent memory modules in agents are unprotected attack surfaces where adversaries can plant dormant instructions that trigger unauthorized actions in future sessions (affects: Retrieval-Augmented Personalization, Memory-Augmented Agents)
Potential fix: Fine-tuning-based defenses significantly reduce attack success (from ~85% to <10%) while preserving utility; activation fingerprinting (Clasp) can detect poisoned tokens via internal activation patterns - Evaluation fragmentation: memory benchmarks conflate retrieval quality with generation faithfulness, and automated judges suffer from position/order/self-preference biases that produce spurious significance (affects: Memory Evaluation Frameworks)
Potential fix: The unified memory quadruple taxonomy and three-setting parallel evaluation protocol help decouple internal capability from external information availability; constraint-consistency metrics avoid length bias - Hardware memory wall: the widening gap between compute scaling and memory bandwidth scaling means inference optimizations hit fundamental physical limits, particularly for decoder-only architectures (affects: KV Cache Compression & Eviction, Speculative Decoding Optimization)
Potential fix: Processing-in-Memory architectures (UPMEM, PIM) and in-memory computing with memristive devices offer potential by performing computation where data resides, eliminating the transfer bottleneck - Cognitive memory collapse: while models can recall explicit facts reasonably well (60-70%), they fail dramatically (30-50%) when required to apply implicit constraints or reason about evolved user states (affects: Memory Evaluation Frameworks, Retrieval-Augmented Personalization)
Potential fix: Multi-turn RL-based alignment (RLPA) and causal preference modeling (NextQuill) show promise by training models to explicitly track and update user state representations
📚 View major papers in this topic (10)
- Retrospective: Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (2023-06) 10
- Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting (2026-03) 9
- Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias (2026-03) 9
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2024-03) 9
- LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning (2024-03) 9
- Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents (2025-03) 9
- Memory in Large Language Models: Mechanisms, Evaluation and Evolution (2025-09) 9
- Agentic Neurosymbolic Collaboration for Mathematical Discovery (2026-03) 9
- High-threshold and low-overhead fault-tolerant quantum memory (2023-08) 9
- Transformer Feed-Forward Layers Are Key-Value Memories (2021-11) 8
💡 Shifting from category-based analysis to cross-cutting themes, we begin with Long-context Memory Management, which tackles the infrastructure-level challenge of KV cache compression, context window extension, and efficient attention mechanisms that underpin all higher-level memory architectures.
Long-context Memory Management
What: This topic covers techniques for managing memory in large language models when processing long contexts, including KV cache compression and eviction, context window extension, efficient attention mechanisms, position encoding strategies, and agent memory systems.
Why: As LLMs are deployed in agentic systems, multi-turn conversations, and document-intensive tasks, the quadratic cost of attention and linear growth of the KV cache create severe computational and memory bottlenecks that limit throughput, context length, and deployment on resource-constrained devices.
Baseline: The standard approach stores all key-value pairs from every token in GPU memory and performs full self-attention over the entire history at each generation step, with contiguous memory allocation and fixed positional encodings (e.g., RoPE).
- KV cache memory grows linearly with sequence length and batch size, quickly exhausting GPU memory for long contexts
- Identifying which tokens are important for future generation is fundamentally difficult since the model cannot foresee upcoming queries
- Compressing or evicting context risks losing critical information needed for downstream reasoning, especially in multi-hop tasks
- Position encodings trained on short sequences fail to generalize to longer contexts, causing out-of-distribution attention patterns
🧪 Running Example
Baseline: A standard LLM would either exceed its context window and lose the early discussion entirely, or store all 681 turns in the KV cache, consuming massive GPU memory. In practice, 21.8% of stored tokens (tool schemas, stale outputs) are structural waste that degrades attention quality and increases latency quadratically.
Challenge: The model must selectively remember specific technical details from turn 481 and turn 631, while forgetting thousands of intermediate tool calls, error messages, and irrelevant code snippets—mimicking human working memory rather than a tape recorder.
📈 Overall Progress
The field shifted from passive full-cache retention to intelligent, learned memory management where models actively decide what to remember, compress, or forget.
📂 Sub-topics
KV Cache Compression & Eviction
8 papers
Methods that reduce the size of the key-value cache by scoring token importance and selectively evicting or merging less important entries, enabling long-context inference within fixed memory budgets.
Serving & Memory Management Infrastructure
6 papers
Systems-level approaches that apply operating system concepts (virtual memory, paging, demand loading) to manage KV cache allocation across distributed GPU and CPU resources.
Agent Memory & Context Management
8 papers
Approaches that equip LLM agents with active memory management capabilities, using reinforcement learning or learned policies to decide what to store, compress, or discard during long-horizon tasks.
Memory-Augmented Architectures
10 papers
Novel neural architectures that extend Transformers with explicit memory modules, including latent-space memory banks, hierarchical attention, associative memories, and external memory retrieval mechanisms.
Position Encoding & Attention Optimization
5 papers
Techniques that improve how LLMs encode token positions and allocate attention, enabling better generalization to longer contexts and reducing distraction from irrelevant information.
Personalization & Long-term User Memory
5 papers
Methods and benchmarks for tracking and leveraging evolving user preferences, traits, and personas across long conversation histories to deliver personalized responses.
Context Compression & Summarization
7 papers
Approaches that compress long contexts through summarization, soft token compression, visual rendering, or task-aware KV cache distillation to fit more information within limited context windows.
Memory Frameworks, Taxonomies & Evaluation
5 papers
Surveys, taxonomies, and evaluation frameworks that formalize LLM memory types, define atomic operations, and provide benchmarks for measuring memory utilization capabilities.
💡 Key Insights
💡 KV cache eviction is most effective when aligned with future decoding patterns, not just past attention scores.
💡 Models trained with RL to manage their own memory can extrapolate to context lengths 10-400x beyond training.
💡 OS concepts (paging, virtual memory, demand loading) translate remarkably well to LLM memory management.
💡 Position encoding matters more than semantic content for KV cache importance scoring during prefill.
💡 Frontier models achieve only ~50% accuracy on tracking evolving user preferences, revealing a major capability gap.
💡 Compressing context often improves performance by removing distracting information, not just saving memory.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Research evolved from systems-level innovations (PagedAttention, vAttention) through architectural memory augmentation (LM2, InfLLM) to the current frontier where RL-trained agents actively curate their own working memory, converging systems engineering with learned intelligence.
- PagedAttention (vLLM, 2023) introduced virtual memory paging for KV cache, reducing waste from 60-80% to under 4% and improving serving throughput 2-4x
- (LongMem, 2023) pioneered decoupled memory architecture with a frozen backbone and trainable SideNet, achieving state-of-the-art on the ChapterBreak benchmark
- (CMANPs, 2023) proposed constant-memory attention blocks using reformulated cross-attention as a rolling average operation
- (DMC, 2024) taught models to dynamically compress their own KV cache via retrofitting, achieving 350-700% throughput gains on H100 GPUs
- (InfLLM, 2024) demonstrated training-free extrapolation to 1M tokens by offloading KV blocks to CPU with representative token retrieval
- (ReadAgent, 2024) introduced human-inspired gist memory that extended effective context by 3.5x-20x while surpassing full-context baselines
- vAttention (vAttention, 2024) replaced PagedAttention with CUDA VMM-based demand allocation, improving throughput by up to 1.99x without kernel rewrites
- (NAMMs, 2024) evolved neural memory managers that outperformed full-context Llama-3-8B by 11% on LongBench while reducing cache size
- (TAPE, 2025) introduced contextualized equivariant positional encoding that updates positions layer-by-layer, achieving state-of-the-art perplexity of 7.063 on PG-19 at 8K length
- LM2 (LM2, 2025) added dual-stream gated memory to Transformers, outperforming RMT by 37.1% and improving MMLU by 5.0% over vanilla Llama-3.2
- (RLMs, 2025) enabled symbolic recursion over prompts via a REPL environment, outperforming GPT-5 by 28.4% on long-context tasks
- (RePo, 2025) learned content-aware token positions, improving RULER scores by +11.04 points by reducing extraneous cognitive load
- (PersonaMem, 2025) revealed frontier models achieve only ~50% accuracy on evolving persona tracking across 1M-token histories
- (MemAgent, 2025) used RL to train memory overwrite, achieving >95% accuracy at 512K tokens and extrapolating to 3.5M tokens with linear complexity
- (CAT, 2025) matched dense transformer quality while being 1.4-3x faster and 2-9x more memory efficient via parallel chunk compression
- Memory Mosaics v2 (Memory Mosaics, 2025) scaled associative memory networks to 10B parameters, outperforming Transformers by 12-15% on multi-document QA
- PersonaMem-v2 (PersonaMem-v2, 2025) demonstrated RL-trained agentic memory outperforming GPT-5 on implicit personalization while using 16x fewer tokens
- (Rethinking Memory, 2025) formalized six atomic memory operations and the Relative Citation Index for trend analysis
- (StateLM, 2026) introduced the Pensieve paradigm where models actively delete their own context, achieving 52% accuracy on BrowseComp-Plus vs 5% for standard LLMs
- (Pichay, 2026) built a complete demand-paging system for LLM context, reducing consumption by 93% with a 0.025% page fault rate across 1.4M simulated evictions
- (LycheeCluster, 2026) combined structure-aware chunking with hierarchical KV indexing for 3.6x end-to-end inference speedup over full attention
- (LookaheadKV, 2026) reduced eviction cost by 14.5x using learnable tokens that predict future attention patterns with negligible overhead
- (LongFlow, 2026) achieved 11.8x throughput for reasoning models by fusing KV eviction directly into FlashAttention kernels
- (MemPO, 2026) optimized memory as an intrinsic RL action with dual rewards, gaining +25.98% F1 while cutting tokens by 67%
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| KV Cache Eviction with Importance Scoring | Predict which cached tokens the model will actually need during generation, and discard the rest to fit within a fixed memory budget. | Full KV cache retention (which grows linearly with sequence length) and simple heuristics like keeping only recent tokens (sliding window) | LycheeCluster (2026), LookaheadKV (2026), Where Matters More Than What:... (2026), Dynamic Memory Compression (2024) |
| Virtual Memory and Paging for KV Cache | Treat the KV cache like OS virtual memory: allocate on demand, page to disk when cold, and share across requests via reference counting. | Static contiguous memory allocation that wastes 60-80% of GPU memory due to fragmentation and over-provisioning | Efficient Memory Management for Large... (2023), The Missing Memory Hierarchy: Demand... (2026), vAttention: Dynamic Memory Management for... (2024), MemServe (2024) |
| RL-based Active Memory Management | Let the model learn through trial-and-error rewards what information to keep, compress, or discard from its working memory. | Fixed external memory modules and rule-based context truncation that cannot adapt to task-specific information needs | MemAgent (2025), StateLM (2026), MemPO (2026), Mem-α: Training LLMs to Manage... (2025) |
| Hierarchical and Compressed Attention | Compress past context into compact hierarchical representations so that each new token attends to summaries rather than the full history. | Full self-attention (quadratic cost) and simple sliding window attention (which loses distant context entirely) | Compress & Attend Transformer (2025), PHOTON (2025), Memory Mosaics at scale (2025), Slow-Fast Inference (2026) |
| Latent-Space Memory Augmentation | Maintain a persistent memory bank of compressed past states that the model can query via learned retrieval, decoupling memory capacity from context window size. | Context window limits that force the model to either truncate history or process prohibitively long sequences | LM2 (2025), M+: Extending MemoryLLM with Scalable... (2025), InfLLM (2024), Language Models Augmented with Long-Term... (2023) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| RULER (Long-Context Recall) | Accuracy | >95% | MemAgent (2025) |
| Needle-in-a-Haystack (NIAH) | Accuracy | 100% | InfLLM (2024) |
| BABILong (Multi-step Reasoning over Long Context) | Average Accuracy | +37.1% over RMT | LM2 (2025) |
⚠️ Known Limitations (5)
- Most KV cache eviction methods rely on prompt-phase attention patterns that poorly predict actual decoding-time importance, causing loss of critical information for complex reasoning tasks. (affects: KV Cache Eviction with Importance Scoring, Hierarchical and Compressed Attention)
Potential fix: DapQ and LookaheadKV address this by simulating future query positions or using learnable lookahead tokens to better predict decoding-time importance. - RL-based memory methods require extensive training and careful reward design; the dense memory-quality reward often needs ground-truth answers, limiting applicability to tasks without clear correctness signals. (affects: RL-based Active Memory Management)
Potential fix: SUPO's joint optimization of summarization and task performance within the MDP framework and Agent-Omit's dual-sampling strategy offer paths toward more generalizable RL-based approaches. - Memory-augmented architectures require modifications to the base model or additional training, making them harder to apply to existing deployed models compared to training-free methods. (affects: Latent-Space Memory Augmentation, Hierarchical and Compressed Attention)
Potential fix: Training-free approaches like InfLLM and Slow-Fast Inference demonstrate effective long-context handling without architectural changes, though they may not match the performance ceiling of trained approaches. - Personalization benchmarks reveal that models struggle with implicit preference tracking and dynamic persona updates, with accuracy dropping to 30-50% on tasks requiring integration of new information with historical context. (affects: Personalization via Long-Context Memory)
Potential fix: PersonaMem-v2's RL-trained agentic memory approach shows promise, achieving 55% accuracy while using 16x fewer tokens by maintaining compact, dynamically updated user profiles. - Evaluation frameworks remain fragmented: most benchmarks test retrieval (needle-in-a-haystack) but not complex operations like state tracking, editing, or forgetting, making it difficult to compare methods holistically. (affects: KV Cache Eviction with Importance Scoring, Latent-Space Memory Augmentation, RL-based Active Memory Management)
Potential fix: The programmable test framework from paper 1232 and the layered evaluation protocol from paper 8490 offer more comprehensive approaches that decompose memory into atomic capabilities.
📚 View major papers in this topic (9)
- Efficient Memory Management for Large Language Model Serving with PagedAttention (2023-09) 9
- The Missing Memory Hierarchy: Demand Paging for LLM Context Windows (2026-03) 9
- MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent (2025-07) 9
- StateLM: To the Rescue of Long-Horizon Reasoning with Recursive Memory (2026-02) 9
- Compress & Attend Transformer (2025-12) 9
- Recursive Language Models (2025-01) 9
- PersonaMem-v2: Implicit Personas (2025-12) 9
- Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics (2025-05) 9
- Memory in Large Language Models: Mechanisms, Evaluation and Evolution (2025-09) 9
💡 The infrastructure for handling extended sequences directly enables Conversational and Dialogue Memory, where the challenge shifts from raw token management to maintaining persistent user preferences, persona traits, and contextual facts across multi-session dialogues spanning weeks or months.
Conversational and Dialogue Memory
What: Research on enabling AI systems to maintain persistent, coherent memory across extended multi-turn and multi-session conversations, including retaining user preferences, persona traits, and contextual facts over time.
Why: As LLM-based assistants become daily-use tools, users expect them to remember past interactions, adapt to personal preferences, and maintain consistency—capabilities that fixed context windows fundamentally cannot support.
Baseline: The conventional approach feeds recent conversation history directly into the LLM's context window or uses simple top-k embedding similarity retrieval over stored dialogue, which fails as conversations grow beyond the context limit and retrieval becomes imprecise.
- Context window limitations prevent LLMs from accessing full conversation histories spanning weeks or months of interaction
- Retrieving the right memory at the right time requires understanding query intent, not just surface-level keyword or embedding similarity
- User preferences are often expressed implicitly through behavior rather than explicit statements, making them difficult to detect and store
- Memory must be dynamically updated—adding, merging, and deleting information—as user facts and preferences evolve over time
🧪 Running Example
Baseline: A standard LLM with a fixed context window has no access to the three-week-old conversation. Even a basic RAG system may fail because the query 'suggest a restaurant' has low semantic similarity to the earlier discussion about dietary changes, retrieving irrelevant past exchanges instead.
Challenge: The user's vegetarian preference was mentioned implicitly during a health discussion, not as a direct 'I am vegetarian' statement. Retrieving this requires understanding that dietary preferences are relevant to restaurant recommendations—a semantic leap that surface-level retrieval misses.
📈 Overall Progress
The field evolved from fixed context windows to autonomous, self-improving memory systems that organize, retrieve, and evolve conversational knowledge using OS and graph paradigms.
📂 Sub-topics
Memory Architecture and Management
10 papers
Systems that structure and manage conversational memory using hierarchical, graph-based, or OS-inspired architectures to enable persistent, organized storage and efficient retrieval across long time horizons.
Personalization and Preference Learning
9 papers
Methods for learning, storing, and applying user-specific preferences and communication styles, including parametric fine-tuning, causal modeling, and embedding-based approaches.
Retrieval-Augmented Dialogue Memory
3 papers
Approaches that enhance multi-turn dialogue by dynamically retrieving relevant context from conversation history, social interactions, or structured memory stores using tool-augmented or history-aware retrieval strategies.
Evaluation Benchmarks and Datasets
4 papers
Benchmarks and evaluation frameworks that measure LLM performance on long-term conversational memory, preference adherence, multi-turn instruction following, and cognitive reasoning over dialogue history.
💡 Key Insights
💡 OS-inspired memory hierarchies with self-directed paging enable unbounded conversation length without losing critical context.
💡 Graph-based memory structures outperform flat vector stores for multi-hop reasoning across interconnected user facts.
💡 Implicit user preferences are far harder to capture than explicit statements, with most models scoring below 10% zero-shot.
💡 Reinforcement learning can train compact memory summaries that outperform frontier models using 16x fewer tokens.
💡 Cognitive memory evaluation reveals that factual recall scores drastically overestimate true conversational understanding.
💡 Reflective memory that learns from its own retrieval successes adapts to individual users without human annotation.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Research progressed from pioneering OS-inspired memory architectures (MemGPT, 2023) through graph-based and reflective memory systems (2024-2025) to implicit personalization via reinforcement learning and cognitive evaluation frameworks that expose fundamental gaps between factual recall and true preference understanding (2025-2026).
- (MemGPT, 2023) pioneered OS-inspired virtual context management, achieving +60% accuracy improvement on deep memory retrieval tasks
- (Pearl, 2023) introduced generation-calibrated retrieval for personalized writing, training retrievers whose scores correlate with downstream generation quality
- (DPeM, 2023) applied dual-process memory with working, short-term, and long-term tiers to medical assistant personalization
- (LoCoMo, 2024) established the first very long-term dialogue benchmark (300+ turns), revealing that LLMs lag behind humans by 56-73% on memory tasks
- (EMG, 2024) introduced editable memory graphs with RL-driven traversal, outperforming baselines by ~10.6% on QA after weeks of continuous edits
- (PrefEval, 2025) revealed that preference following accuracy falls below 10% for most models in zero-shot settings
- (RMM, 2025) introduced reflective memory with prospective topic decomposition and retrospective RL-trained reranking, gaining +10% on LongMemEval
- (A-Mem, 2025) applied Zettelkasten-inspired atomic notes with self-evolving links, improving F1 by 35% over LoCoMo baselines while reducing token usage by 85-93%
- Mem0 (Mem0, 2025) introduced dynamic dual-phase memory extraction with a graph variant, achieving 26% improvement over baselines with 91% latency reduction
- PersonaMem-v2 (PersonaMem-v2, 2025) demonstrated that RL-trained agentic memory outperforms GPT-5 on implicit personalization using 16x fewer tokens
- (SGMem, 2025) combined sentence-level graphs with joint indexing of raw dialogue and generated summaries, outperforming LightRAG on LongMemEval
- (MemoryOS, 2025) extended OS-inspired memory with segmented paging and heat-based eviction, achieving +49% F1 on LoCoMo
- (LoCoMo-Plus, 2026) exposed that cognitive memory performance collapses compared to factual recall, and that task disclosure artificially inflates scores
- (TA-Mem, 2026) transformed retrieval into an agentic tool-selection task, gaining +7 F1 over Mem0 on temporal QA while using 4x fewer tokens than full-context methods
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Virtual Context Management | Let the LLM manage its own memory like an operating system manages virtual memory, swapping data between limited fast context and unlimited external storage. | Fixed context window approaches that truncate or summarize old conversation history | MemGPT (2023), MemoryOS (2025), LLM-based (2023) |
| Graph-Based Memory Organization | Structure conversational memories as interconnected graphs to enable relational reasoning and precise multi-hop retrieval that flat vector stores cannot support. | Flat vector-based memory stores that retrieve isolated facts without relational context | Crafting Personalized Agents through Retrieval-Augmented... (2024), Mem0 (2025), SGMem (2025), Agentic Memory (2025) |
| Reflective and Adaptive Memory Management | Use the LLM's own memory usage patterns as feedback to train a reranker that adapts retrieval to specific user interaction styles without human labels. | Static retrieval methods that use fixed similarity thresholds regardless of user or query type | In Prospect and Retrospect: Reflective... (2025) |
| Tool-Augmented and Dynamic Retrieval | Expose multiple memory indices as callable tools, letting the LLM agent decide the retrieval strategy rather than forcing all queries through a single embedding-similarity pipeline. | Single-index top-k similarity retrieval that treats all query types identically | TA-Mem (2026), DH-RAG (2025), Social-RAG (2025) |
| Parametric Personalization via Fine-tuning | Encode user conversation history and preferences directly into model parameters using efficient fine-tuning, eliminating the need for runtime retrieval. | Retrieval-augmented generation which requires external storage management and adds retrieval latency | On the Way to LLM... (2024), Enabling On-Device Large Language Model... (2023), Latent Inter-User Difference Modeling for... (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LoCoMo | F1 / BLEU-1 | +49.11% F1 over baselines | MemoryOS (2025) |
| LongMemEval | Accuracy | 70.4% | In Prospect and Retrospect: Reflective... (2025) |
| PrefEval | Preference Following Accuracy | Significant improvement over zero-shot | Do LLMs Recognize Your Preferences?... (2025) |
⚠️ Known Limitations (5)
- Scalability of memory operations: As conversation histories grow to thousands of sessions, memory extraction, graph updates, and retrieval become computationally expensive, limiting real-time responsiveness. (affects: Graph-Based Memory Organization, Virtual Context Management, Reflective and Adaptive Memory Management)
Potential fix: Mem0 addresses latency with efficient dual-phase pipelines (91% p95 latency reduction), and SGMem uses lightweight NLTK-based graph construction instead of expensive LLM-based entity-relation extraction. - Evaluation gaps for implicit and cognitive memory: Most benchmarks test explicit factual recall, which overestimates real-world performance where preferences are implicit and constraints require inference beyond lexical overlap. (affects: Causal and Implicit Preference Modeling, Parametric Personalization via Fine-tuning)
Potential fix: LoCoMo-Plus introduces constraint-consistency evaluation that measures behavioral adherence rather than string matching, providing a more realistic assessment of memory capabilities. - Privacy and on-device constraints: Storing detailed user conversation histories raises privacy concerns, and on-device personalization is limited by storage capacity and the inability to offload data for cloud-based annotation. (affects: Parametric Personalization via Fine-tuning, Virtual Context Management)
Potential fix: SDSS proposes self-supervised data selection with entropy-based filtering and local synthetic data augmentation, avoiding cloud offloading while maintaining personalization quality. - Memory staleness and conflict resolution: As user preferences evolve over time, stored memories can become outdated or contradictory, and most systems lack principled mechanisms to detect and resolve conflicts between old and new information. (affects: Graph-Based Memory Organization, Parametric Personalization via Fine-tuning, Tool-Augmented and Dynamic Retrieval)
Potential fix: A-Mem implements memory evolution where new experiences trigger rewrites of old memory contexts, and Mem0 uses an update/delete pipeline to manage changing facts. - Degradation under adversarial and conflicting instructions: Models that perform well on standard memory retrieval show severe degradation when faced with adversarial questions or entangled multi-turn constraints, indicating brittle memory integration. (affects: Virtual Context Management, Causal and Implicit Preference Modeling)
Potential fix: MultiTurnInstruct identifies that stronger reasoning does not guarantee better conflict resolution, suggesting that dedicated training on constraint-conflict scenarios may be needed beyond general instruction tuning.
📚 View major papers in this topic (9)
- MemGPT: Towards LLMs as Operating Systems (2023-10) 9
- PersonaMem-v2: Implicit Personas (2025-12) 9
- Evaluating Very Long-Term Conversational Memory of LLM Agents (2024-02) 8
- LoCoMo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents (2026-02) 8
- Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs (2025-02) 8
- In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents (2025-03) 8
- Mem0: Memory-Centric Architecture for Large Language Models (2025-04) 7
- MemoryOS: A Comprehensive Memory Operating System for AI Agents (2025-11) 7
- TA-Mem: Tool-Augmented Autonomous Memory Retrieval for LLM in Long-Term Conversational QA (2026-03) 7
💡 The accumulation of user-specific knowledge across extended dialogues inevitably raises the challenge of Continual Learning and Catastrophic Forgetting—how models can sequentially absorb new information from evolving interactions without overwriting the knowledge they have already acquired.
Continual Learning and Catastrophic Forgetting
What: Continual learning studies how models can sequentially acquire new knowledge or skills from non-stationary data streams without losing previously learned information, a failure mode known as catastrophic forgetting.
Why: Real-world AI systems must adapt to evolving data, new tasks, and changing user needs over time. Without continual learning, models require costly full retraining or suffer degraded performance on earlier capabilities.
Baseline: The conventional approach is sequential fine-tuning, where a model is updated on new task data using standard gradient descent. This typically causes catastrophic forgetting because new parameter updates overwrite weights that encoded prior knowledge.
- Stability-plasticity dilemma: balancing the ability to learn new information (plasticity) while retaining old knowledge (stability)
- Scalability: maintaining performance as the number of sequential tasks grows into the hundreds or thousands without proportional growth in memory or parameters
- Task-agnostic inference: performing well without knowing which task a given input belongs to (class-incremental setting), which is far harder than task-incremental settings
- Evaluation beyond accuracy: measuring not just final performance but also backward transfer (forgetting), forward transfer, and computational overhead
🧪 Running Example
Baseline: Standard fine-tuning on the furniture dataset causes the model to lose electronics-specific terminology, troubleshooting flows, and product knowledge. After training on furniture, accuracy on electronics queries drops from 92% to below 40%.
Challenge: The model has limited capacity to store both domains. Furniture training updates the same parameters that encoded electronics knowledge, and without access to the original electronics training data, there is no way to remind the model of what it once knew.
📈 Overall Progress
The field has shifted from treating forgetting as an unavoidable side effect to be mitigated, toward provably forgetting-free architectures guided by information-theoretic bounds and modular memory systems.
📂 Sub-topics
Theoretical Frameworks and Taxonomies
2 papers
Formal information-theoretic analyses and comprehensive surveys that explain why forgetting occurs and categorize the landscape of continual learning strategies.
Parameter-Efficient Continual Adaptation
5 papers
Methods that freeze the pretrained backbone and apply lightweight modifications (gating, representation interventions, classifier alignment) to adapt to new tasks while preserving old knowledge.
Dynamic Routing and Architecture Growth
3 papers
Approaches that expand or dynamically route through network components to accommodate new tasks, including energy-based routing, soft masking, and adaptive network growth.
Replay and Rehearsal Optimization
3 papers
Methods that store and selectively replay past experiences to mitigate forgetting, with innovations in replay scheduling, adversarial diversification, and dual-buffer strategies.
Continual Knowledge Editing for LLMs
4 papers
Techniques for injecting, updating, or correcting knowledge in large language models over long sequences of edits without degrading prior knowledge or general capabilities.
Agent Memory and Continual Adaptation
4 papers
Memory systems for autonomous agents that learn reusable workflows, curate episodic experiences, and adapt retrieval strategies without fine-tuning the underlying LLM.
💡 Key Insights
💡 Pre-trained backbones retain more knowledge than assumed; forgetting is often a classifier alignment problem, not a representation problem.
💡 Information-theoretic bounds prove that sequential state-based learners face an impossibility triangle between zero forgetting, online learning, and finite parameters.
💡 Modular frozen adapters with routing mechanisms scale to thousands of sequential edits without interference between updates.
💡 Representation-space interventions with orthogonality constraints outperform weight-space fine-tuning across all incremental learning settings.
💡 Agent memory systems that optimize retrieval rather than model parameters enable continual improvement without any fine-tuning.
💡 Adversarial diversification of replay buffers is more effective than increasing buffer size for combating memory overfitting.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work focused on regularization and simple replay, but pre-trained model dominance revealed that forgetting is often a classifier alignment issue rather than a representation problem. The latest wave leverages modular frozen adapters, energy-based routing, and representation-space interventions to decouple stability from plasticity, while agent memory systems extend continual learning beyond supervised settings into interactive, open-ended environments.
- SEQ* (SEQ*, 2023) revealed that pre-trained language model backbones retain knowledge through sequential training, challenging the assumption that catastrophic forgetting is inevitable and showing that simple freezing strategies outperform complex methods
- (Lifelong Learning Primer, 2024) established a unified taxonomy categorizing strategies into regularization, memory, and architecture families with formal metrics for forgetting and intransigence
- (ADRM, 2024) applied adversarial perturbations to replay buffers, achieving +32.35% robustness improvement on corrupted data compared to standard rehearsal methods
- (ICU, 2024) introduced iterative contrastive unlearning for selective knowledge removal without model collapse, reducing extraction likelihood from 0.40 to 0.04
- (AWM, 2024) pioneered workflow induction for agents, enabling a +51.1% relative improvement on WebArena through reusable parameterized routines
- (Agent S, 2024) combined narrative and episodic memory in a hierarchical planning framework, achieving 83.6% relative improvement on OS-level task automation
- (MEGa, 2025) introduced per-memory LoRA adapters with context-key gating, maintaining >90% recall after 50 sequential knowledge injection tasks where baselines collapsed to <10%
- (Memento, 2025) formalized agent learning as a Memory-augmented MDP with online RL-based case retrieval, achieving 87.88% Pass@3 on GAIA without any LLM fine-tuning
- (MEMOIR, 2025) scaled lifelong model editing to 15,000 sequential edits on LLaMA-3-8B using sparse residual memory with TopHash retrieval
- (CCC, 2026) proved the Impossibility Triangle: zero forgetting, online learning, and finite parameters cannot coexist for sequential state-based learners, establishing information-theoretic lower bounds for the field
- (RwF, 2026) introduced energy-based Hopfield routing for online continual learning, achieving 74.09% accuracy on Split-ImageNet-R with only 2.1% additional parameters
- (Panini, 2026) replaced text-chunk retrieval with structured semantic workspaces and reasoning chain retrieval, reducing token usage by 2-30x while improving QA accuracy by 5-7%
- (CoRe, 2026) shifted fine-tuning from weight space to representation space with orthogonality constraints, achieving state-of-the-art results across task-, domain-, and class-incremental settings
- (LCA, 2026) solved the classifier-backbone mismatch problem through Gaussian-based synthetic sample generation and incremental PEFT merging, leading on 7 benchmark datasets
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Information-Theoretic Forgetting Analysis | Zero forgetting requires the architecture's context channel capacity to equal or exceed the entropy of the task distribution. | Empirical intuitions about why some methods forget and others do not, replacing ad-hoc explanations with provable information-theoretic bounds | Context Channel Capacity (2026), An Introduction to Lifelong Supervised... (2024) |
| Frozen Backbone with Lightweight Adaptation | Keep the pretrained backbone frozen and learn only lightweight modifiers that steer existing features toward new tasks without overwriting shared representations. | Full fine-tuning and traditional parameter-efficient methods (LoRA, adapters, prompts) that update backbone parameters and suffer from representation drift | LCA (2026), Representation Finetuning for Continual Learning (2026), Gated Adaptation for Continual Learning... (2026), Learn or Recall? Revisiting Incremental... (2023) |
| Dynamic Routing and Architecture Growth | Decouple the instant routing decision (which subnetwork or prompt to use) from the slow gradient-based parameter updates, enabling immediate adaptation to distribution shifts. | Static prompt pools and fixed architecture methods that cannot adapt quickly enough for online learning or grow unnecessarily | Routing without Forgetting (2026), Don't Look Back in Anger:... (2026), Causally Sufficient and Necessary Feature... (2026) |
| Optimized Replay and Rehearsal | Schedule replay based on estimated forgetting risk per sample rather than fixed intervals or random selection, and diversify the limited replay buffer to prevent overfitting. | Fixed-interval random replay and simple experience replay buffers that waste compute on already-remembered examples or overfit to stored samples | MSSR (2026), Adversarially Diversified Rehearsal Memory (ADRM) (2024), ARROW (2026) |
| Modular Knowledge Editing for LLMs | Treat each knowledge edit as a separate frozen module retrieved by input similarity, so edits never interfere with each other and can be individually added or removed. | Global parameter editing methods (ROME, MEMIT) that degrade rapidly after hundreds of edits due to parameter interference | MEMOIR (2025), MEGa (2025), Reversible Lifelong Model Editing via... (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| Split-ImageNet-R | Final Average Accuracy | 74.09% | Routing without Forgetting (2026) |
| Split-MNIST (Continual Learning) | Accuracy / Forgetting Rate | 98.8% accuracy, ~0% forgetting | Context Channel Capacity (2026) |
| WebArena (Agent Task Completion) | Success Rate | +51.1% relative improvement | Agent Workflow Memory (2024) |
⚠️ Known Limitations (5)
- Most continual learning methods are evaluated on relatively short task sequences (5-20 tasks), making it unclear how they perform at the scale of hundreds or thousands of tasks encountered in real deployment scenarios. (affects: Frozen Backbone with Lightweight Adaptation, Dynamic Routing and Architecture Growth, Optimized Replay and Rehearsal)
Potential fix: MEMOIR demonstrates scaling to 15,000 edits using sparse residual memory, suggesting that modular approaches with efficient retrieval may overcome this limitation. - Routing and modular methods require storing a growing number of modules (LoRA adapters, keys, masks), creating a linear memory overhead that may become prohibitive for resource-constrained environments like edge devices. (affects: Modular Knowledge Editing for LLMs, Dynamic Routing and Architecture Growth, Agent Workflow and Episodic Memory)
Potential fix: Module compression, periodic consolidation of similar modules, and sparse activation patterns (as in MEMOIR's TopHash) can reduce storage overhead. - Class-incremental learning (where task identity is unknown at inference) remains significantly harder than task-incremental settings, with performance gaps of 20+ percentage points, yet is the most realistic deployment scenario. (affects: Frozen Backbone with Lightweight Adaptation, Dynamic Routing and Architecture Growth)
Potential fix: Context Channel Capacity analysis suggests that architectures with sufficient capacity (HyperNetworks) can close this gap; the Gradient Context Encoder reduced the gap from 23.3pp to 0.7pp on CIFAR-10. - Agent memory systems rely on the quality of self-evaluation and workflow abstraction, which can propagate errors if the agent incorrectly assesses its own success or extracts misleading patterns from limited experience. (affects: Agent Workflow and Episodic Memory)
Potential fix: Memento's RL-based case retrieval optimization provides a principled way to learn which experiences are actually useful, rather than relying on heuristic self-evaluation. - Most methods are benchmarked in controlled settings with clearly delineated task boundaries, whereas real-world data streams often have gradual, overlapping distribution shifts without explicit task demarcations. (affects: Optimized Replay and Rehearsal, Frozen Backbone with Lightweight Adaptation, Information-Theoretic Forgetting Analysis)
Potential fix: MAGIC Net's drift detection and adaptive strategy selection addresses this by operating on continuous streams without requiring task boundaries.
📚 View major papers in this topic (10)
- Context Channel Capacity: An Information-Theoretic Framework for Understanding Catastrophic Forgetting (2026-03) 9
- Memento: A Novel Learning Paradigm for Adaptive LLM Agents without Fine-tuning (2025-09) 9
- Routing without Forgetting (2026-03) 8
- LCA: Local Classifier Alignment for Continual Learning (2026-03) 8
- Agent Workflow Memory (2024-09) 8
- MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention (2025-12) 8
- Panini: Continual Learning in Token Space via Structured Memory (2026-02) 8
- Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language (2026-02) 8
- An Introduction to Lifelong Supervised Learning (2024-05) 8
- Agent S: An Open Agentic Framework that Uses Computers Like a Human (2024-10) 8
💡 The stability-plasticity dilemma central to continual learning has deep roots in cognitive science, where human memory naturally balances retention and forgetting through consolidation, spreading activation, and episodic-semantic separation—principles that Cognitive and Human-like Memory research translates into practical AI architectures.
Cognitive and Human-like Memory
What: Research that draws on cognitive science—particularly models of working memory, episodic/semantic memory, attention, and memory consolidation—to design memory systems for LLMs and AI agents.
Why: Standard LLMs process each input statelessly or within a fixed context window, lacking the persistent, structured memory that enables humans to accumulate experience, maintain identity, and reason over long histories.
Baseline: Conventional approaches either expand the raw context window (brute-force token concatenation) or use flat vector-store retrieval (RAG) with no cognitive structure, treating all memories uniformly regardless of type, recency, or relevance.
- Bridging the gap between finite context windows and the need for persistent, long-term memory across sessions
- Retrieving relevant memories when the current query has no surface-level semantic overlap with stored information (cue-trigger disconnect)
- Balancing memory retention and forgetting to prevent redundancy, context drift, and hallucination from stale or conflicting memories
- Designing memory architectures that support identity persistence and continuity when the underlying model is upgraded or replaced
🧪 Running Example
Baseline: A standard LLM with flat vector retrieval finds no semantic match between 'Netflix' and 'medical exams,' so it returns generic trending recommendations with no awareness of the user's stress or study schedule.
Challenge: The relevant memory ('stressed about exams') has zero lexical overlap with the current query ('Netflix tonight'), so keyword and embedding-based retrieval both fail. Moreover, the memory is months old and may have been lost to context window limits.
📈 Overall Progress
The field evolved from simple differentiable memory lookups to sophisticated cognitive architectures with dual-memory stores, active consolidation, and controllable memory mirroring human cognition.
📂 Sub-topics
Dual-Memory and Consolidation Architectures
7 papers
Systems that explicitly separate memory into episodic and semantic stores, often incorporating sleep-like consolidation, working memory gating, and forgetting mechanisms inspired by hippocampal and neocortical processes.
Associative Memory Networks
3 papers
Architectures that replace standard Transformer attention with associative memory units, achieving compositional reasoning through transparent memory operations and supporting multi-level (short-term, long-term, persistent) memory stores.
Attention and Cognitive Load Optimization
4 papers
Methods that improve how models allocate attention over long contexts, drawing on cognitive load theory and working memory constraints to make attention more efficient and context-aware.
Cognitive Memory Benchmarks and Taxonomies
4 papers
Surveys, benchmarks, and evaluation frameworks that assess memory capabilities from a cognitive science perspective, moving beyond simple factual recall to test implicit constraint adherence and cognitive memory types.
Cognitive Models for Agent Behavior
3 papers
Research applying cognitive memory models to agent decision-making, including navigation under memory constraints, sentence processing with finite particles, and detecting cognitive degradation in autonomous agents.
💡 Key Insights
💡 Separating memory into episodic and semantic stores consistently outperforms flat retrieval for personalization and long-horizon tasks.
💡 Active forgetting and sleep-like consolidation are essential to prevent memory bloat, context drift, and hallucination from stale information.
💡 Current memory benchmarks dramatically overestimate model capabilities by testing only explicit factual recall, not implicit constraint adherence.
💡 Associative memory networks offer a transparent, scalable alternative to Transformers with superior context extrapolation.
💡 Treating position assignment as a cognitive load optimization problem yields substantial improvements on long-context reasoning tasks.
💡 Making neural memory controllable via natural language instructions transforms memory from passive recording to active knowledge management.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (2015) established differentiable memory access; 2023-2024 introduced efficient and transparent alternatives to Transformer attention; 2025 saw an explosion of cognitive-science-inspired architectures applied to robotics, personalization, and long-context tasks; 2026 has shifted focus to evaluation reform, memory governance, and user-controllable memory systems.
- MemN2N (MemN2N, 2015) introduced end-to-end trainable memory with multi-hop soft attention, achieving 3.2% mean error on bAbI QA and establishing the paradigm of differentiable memory access
- (CMANP, 2023) achieved constant-memory attention via rolling log-sum-exp updates, enabling Neural Processes on resource-constrained devices
- (Memory Mosaics, 2024) replaced Transformer attention with associative memory units, matching perplexity while achieving transparent predictive disentanglement
- (RePo, 2025) applied Cognitive Load Theory to learn content-dependent token positions, improving RULER benchmark scores by +11 points over fixed position encoding
- (Focus Directions, 2025) identified sparse contextual heads and steerable attention vectors, boosting multi-doc QA by +7.7% EM without any training
- Two major surveys (3D-8Q Taxonomy, 2025; Cognitive Memory Taxonomy, 2025) mapped human memory types to LLM components, establishing shared vocabulary for the field
- (PRIME, 2025) formalized episodic-semantic memory for LLM personalization with self-distilled reasoning traces, demonstrating that semantic memory outperforms episodic for capturing user traits
- (MemoryVLA, 2025) brought dual-stream perceptual-cognitive memory to robotics, achieving +26% improvement on real-world long-horizon tasks over the CogACT baseline
- Memory Mosaics v2 (Memory Mosaics v2, 2025) scaled associative memory to 10B parameters and 1T training tokens, outperforming Transformers by 12-15% on multi-document QA
- (Memory Bear, 2025) introduced sleep-based consolidation and Ebbinghaus forgetting curves, reducing inference token usage by ~90%
- (QSAF, 2025) defined a six-stage cognitive degradation lifecycle for AI agents, identifying critical memory drift and planner entrapment vulnerabilities
- (LoCoMo-Plus, 2026) revealed that cognitive memory collapses across all models when implicit constraints are tested, fundamentally challenging existing memory evaluation
- Tell Me What To Learn (Tell Me What To Learn, 2026) made neural memory controllable via natural language instructions, letting users specify what to remember or ignore
- (CMA, 2026) proposed that memory constitutes the agent's identity, introducing constitutional governance and inheritance protocols for persistent digital citizens
- (POMDP, 2026) modeled web navigation as a decision process under memory constraints, replicating human backtracking and partial scanning behaviors
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Dual-Memory Systems | Splitting AI memory into experience-specific recall and abstract knowledge stores, with brain-inspired consolidation bridging the two. | Flat vector-store retrieval (RAG) that treats all memories uniformly without distinguishing episodic experiences from generalized knowledge | PRIME (2025), MemoryVLA (2025), Cognitive algorithms and systems of... (2026) |
| Cognitively-Grounded Memory Orchestration | Memory systems that actively consolidate, forget, and reorganize themselves—mimicking human sleep and forgetting—rather than passively accumulating information. | Static memory stores that grow without bound, causing redundancy, context drift, and increased hallucination risk | Memory Bear (2025), Memory as Ontology (2026) |
| Associative Memory Networks | Replacing opaque Transformer attention with transparent associative memory units that naturally decompose complex prediction tasks into interpretable sub-components. | Standard Transformer attention, which is opaque and degrades with many in-context examples or extreme context lengths | Memory Mosaics (2024), Memory Mosaics at scale (2025) |
| End-to-End Memory Networks | Replacing hard, supervised memory lookups with soft attention over an external memory, enabling end-to-end training with multi-hop reasoning. | Original Memory Networks that required strong supervision for each memory access step | End-To-End (2015) |
| Cognitive Load-Aware Attention | Treating an LLM's attention budget as analogous to human working memory capacity, and optimizing how that budget is spent on relevant versus irrelevant context. | Fixed linear position encoding (e.g., RoPE) that treats all tokens as equally positioned regardless of relevance | RePo (2025), Eliciting Attention on Relevant Contexts... (2025), Summarize Before You Speak with... (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| RULER (Long-Context QA) | Average Accuracy / Exact Match | 12.3-14.8% higher than Transformers at 32k context | Memory Mosaics at scale (2025) |
| LIBERO (Robotic Manipulation Simulation) | Success Rate (%) | 96.5% | MemoryVLA (2025) |
| bAbI QA Tasks | Mean Error Rate (%) | 3.2% mean error | End-To-End (2015) |
⚠️ Known Limitations (5)
- Cognitive memory architectures add significant engineering complexity—maintaining dual stores, consolidation pipelines, and forgetting curves requires careful tuning and may not generalize across domains without adaptation. (affects: Dual-Memory Systems, Cognitively-Grounded Memory Orchestration)
Potential fix: Automated hyperparameter tuning for consolidation schedules and decay rates; meta-learning approaches that adapt memory parameters to domain characteristics. - Evaluation of cognitive memory remains inadequate—most benchmarks still rely on factual recall, and even LoCoMo-Plus covers only limited types of implicit constraints, leaving many aspects of cognitive memory untested. (affects: Constraint-Consistency Evaluation, Dual-Memory Systems)
Potential fix: Developing richer benchmarks that test procedural memory, emotional memory, and cross-modal memory transfer, as called for by multiple survey papers. - Associative memory networks (Memory Mosaics) match Transformers on standard benchmarks but have not yet been validated on the full range of downstream tasks where Transformers dominate, limiting confidence in their generality. (affects: Associative Memory Networks (Memory Mosaics))
Potential fix: Broader evaluation on instruction-following, code generation, and multi-turn dialogue tasks; hybrid architectures combining associative memory with Transformer layers. - Memory governance and identity persistence (Memory-as-Ontology) remain purely conceptual with no quantitative evaluation, making it unclear whether constitutional memory can be implemented efficiently at scale. (affects: Cognitively-Grounded Memory Orchestration)
Potential fix: Developing prototype implementations with measurable identity consistency metrics and formal verification of governance constraints. - Cognitive degradation in long-running agents (memory starvation, planner recursion) is identified but defenses are reactive rather than preventive, and have only been demonstrated on a limited set of models. (affects: Cognitive Degradation Lifecycle (QSAF))
Potential fix: Proactive memory health monitoring integrated into agent training, and standardized stress-test benchmarks for long-running agent deployments.
📚 View major papers in this topic (10)
- End-To-End Memory Networks (2015-03) 9
- MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation (2025-08) 8
- PRIME: Large Language Model Personalization with Cognitive Dual-Memory and Personalized Thought Process (2025-07) 8
- Memory Mosaics at scale (2025-07) 8
- Memory Efficient Neural Processes via Constant Memory Attention Block (2023-05) 8
- Tell Me What To Learn: Generalizing Neural Memory to be Controllable in Natural Language (2026-02) 8
- LoCoMo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents (2026-02) 8
- RePo: Reducing Extraneous Load in Context via Token Re-positioning (2025-01) 8
- Memory Mosaics (2024-05) 7
- Memory Bear: A Human-like Long-term Memory Architecture for Large Language Models (2025-12) 7
💡 Cognitive memory principles face their most demanding test in Embodied and Robotic Memory, where physical agents must maintain spatial maps, manipulation histories, and navigation context while operating under real-time constraints that demand tight integration of perception, memory, and action.
Embodied and Robotic Memory
What: This topic covers memory architectures for embodied agents and robots that must maintain, retrieve, and act upon information gathered from physical interactions over time, including navigation history, manipulation experience, spatial maps, and temporal state tracking.
Why: Embodied agents operating in real-world environments encounter fundamentally non-Markovian tasks—cooking a multi-step recipe, navigating back to a previously visited room, or correcting a failed grasp—where the current observation alone is insufficient. Effective memory systems bridge the gap between perception and long-horizon planning.
Baseline: Most conventional robotic policies and world models treat each observation independently (the Markov assumption), feeding only the current frame or a short fixed-length window of recent frames into the policy network, with no explicit mechanism to recall earlier events or spatial context.
- Balancing short-term reactivity (fast motor control) with long-term recall (tracking task progress across minutes or hours)
- Compressing massive, redundant sensory streams (video, point clouds, proprioception) into bounded memory without losing decision-critical information
- Maintaining spatial consistency when revisiting previously observed environments, especially under perceptual drift in generative world models
- Handling asynchronous information streams where visual perception updates slowly relative to high-frequency action control
🧪 Running Example
Baseline: A standard VLA policy observing only the current camera frame cannot remember which counters were already wiped. It may re-clean the same counter, skip the dishwasher step entirely because it forgot the instruction sequence, or fail to navigate back to the trash can because it has no spatial map of prior movements.
Challenge: The robot must track semantic progress (which sub-tasks are done), maintain spatial awareness (where the trash can was seen 10 minutes ago), and handle occlusions (items hidden behind doors). The full video history is too large to fit in a context window.
📈 Overall Progress
Embodied memory evolved from simple observation stacking to biologically inspired dual-stream architectures with explicit 3D grounding, enabling robots to sustain coherent behavior over 15-minute horizons.
📂 Sub-topics
Memory for Robotic Manipulation
5 papers
Memory architectures integrated into vision-language-action (VLA) models that enable robots to condition manipulation actions on past observations, overcoming the Markov assumption for multi-step tasks.
Geometry-Grounded Spatial Memory
4 papers
Memory systems that store and retrieve 3D geometric information—point clouds, depth maps, or spatial coordinates—enabling consistent reconstruction and scene generation during revisits.
Memory in World Models
2 papers
Techniques for extending the effective memory span of learned world models used in reinforcement learning and video generation, addressing catastrophic forgetting and perceptual drift.
Retrieval-Augmented Embodied Memory
3 papers
Systems that structure an embodied agent's experience into retrievable databases—hierarchical semantic forests or progressive trajectory stores—enabling query-driven recall for navigation and task planning.
💡 Key Insights
💡 The Markov assumption is a critical bottleneck: memory-augmented policies outperform memoryless baselines by 26–39% on temporal tasks.
💡 Explicit 3D geometry outperforms appearance-based retrieval for spatial memory, offering O(1) lookup and 98% storage reduction.
💡 Biological memory models (working vs. episodic vs. semantic) transfer effectively to robot architecture design.
💡 Multi-scale memory—dense visual tokens for short-term, compressed text summaries for long-term—enables 15-minute robot tasks.
💡 Self-generated experience progressively builds better retrieval databases, eliminating the need for expert demonstrations.
💡 Decoupling action frequency from perception frequency via hybrid caching resolves a fundamental VLA design bottleneck.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (2024) established foundational memory primitives—spatial reconstruction memory and retrieval-augmented experience stores. By mid-2025, memory-augmented VLA models demonstrated dramatic gains on manipulation benchmarks, while 2026 brought scaling to long-horizon real-world tasks and principled solutions for asynchronous multi-modal memory streams.
- Spann3R (Spann3R, 2024) introduced spatial memory with working and long-term components for real-time 3D reconstruction at 50+ FPS
- (KARMA, 2024) integrated long-term and short-term memory into LLM-based embodied planning via memory-augmented prompting
- (Embodied-RAG, 2024) built hierarchical semantic forests for kilometer-scale navigation retrieval, 7.38x faster than GraphRAG
- (P-RAG, 2024) introduced progressive self-experience retrieval for embodied task planning without ground-truth demonstrations
- SAM2(SAM2Act, 2025) set a new state-of-the-art on RLBench (86.8%) and dominated memory-dependent tasks with 94.3% on MemoryBench, 39.3% above the next best baseline
- Point3R (Point3R, 2025) replaced implicit memory with explicit 3D spatial pointers and 3D RoPE, generalizing across 14 diverse reconstruction datasets
- (MemoryVLA, 2025) introduced perceptual-cognitive consolidation, improving +26% over CogACT on real-world temporal manipulation tasks
- (Memory Forcing, 2025) introduced chained forward training on model rollouts with geometry-indexed retrieval, achieving 98.2% memory reduction for consistent scene generation
- (MEM, 2026) combined factorized video encoding with LLM-managed text summaries to enable 15-minute robot tasks like full kitchen cleaning
- (AR-VLA, 2026) proposed a hybrid key-value cache with dynamic temporal re-anchoring to resolve the frequency mismatch between fast control and slow perception
- (ARROW, 2026) achieved 4x less forgetting in continual RL through dual replay buffers with reservoir sampling in world models
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Dual-Stream Memory Banks | Separate fast-decaying perceptual memory from slow-consolidating semantic memory, mimicking the biological distinction between working memory and hippocampal long-term storage. | Fixed-window observation stacking, which discards all history beyond a short horizon and cannot track multi-step task progress. | SAM2Act (2025), MemoryVLA (2025), MEM (2026), KARMA (2024) |
| Geometry-Grounded Spatial Memory | Anchor memory to physical 3D coordinates so that spatial proximity governs storage, retrieval, and fusion, eliminating appearance-based drift. | Implicit neural memories with fixed capacity that lose information from earlier frames and require expensive global optimization for alignment. | 3D Reconstruction with Spatial Memory (2024), Point3R (2025), Memory Forcing (2025), Video World Models with Long-term... (2025) |
| Autoregressive Action Memory with Hybrid Caching | Maintain a rolling action history as a causal sequence with modality-specific caching strategies that respect the different update frequencies of vision and proprioception. | Standard VLA models that treat each observation independently ('Markovian amnesia'), resetting temporal context at every control step. | AR-VLA (2026) |
| Retrieval-Augmented Embodied Memory | Index embodied experience hierarchically and retrieve relevant subsets on demand, extending text-based RAG paradigms to handle spatial, visual, and trajectory data. | Naive approaches that either dump all history into the context window (creating noise and latency) or discard it entirely. | Embodied-RAG (2024), Progressive Retrieval Augmented Generation for... (2024), EmBARDiment (2024) |
| Augmented Experience Replay for World Models | Combine short-term plasticity and long-term stability buffers with reservoir sampling and episode splicing to enable continual learning in world models without growing memory. | Standard experience replay with a single buffer, which either forgets old tasks or becomes prohibitively large. | ARROW (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MemoryBench | Success Rate (%) | 94.3% | SAM2Act (2025) |
| RLBench (18 tasks) | Average Success Rate (%) | 86.8% | SAM2Act (2025) |
| LIBERO | Success Rate (%) | 96.5% | MemoryVLA (2025) |
⚠️ Known Limitations (5)
- Memory consolidation heuristics are hand-designed (e.g., fixed similarity thresholds for merging entries), which may not generalize across task domains or time scales. (affects: Dual-Stream Memory Banks, Perceptual-Cognitive Memory Bank (MemoryVLA))
Potential fix: Learnable consolidation policies that adapt merging thresholds based on task context or prediction error. - Spatial memory methods assume predominantly static environments; dynamic objects (moving people, shifting furniture) can corrupt the stored 3D representation and produce incorrect retrievals. (affects: Geometry-Grounded Spatial Memory, Explicit Spatial Pointer Memory)
Potential fix: Decoupling static and dynamic scene components (as begun in Video World Models) and maintaining separate update schedules for each. - Most memory-augmented VLA evaluations are conducted in simulation or controlled lab settings; transfer to unstructured real-world environments with diverse lighting, clutter, and task variability remains underexplored. (affects: Dual-Stream Memory Banks, Autoregressive Action Memory with Hybrid Caching)
Potential fix: Scaling real-world evaluation datasets and incorporating domain randomization during memory system training. - Retrieval-augmented embodied memory incurs latency during retrieval and may retrieve irrelevant experiences when the index is large or the query is ambiguous. (affects: Retrieval-Augmented Embodied Memory, Semantic Forest Memory)
Potential fix: Adaptive retrieval budgets that scale with task complexity and learned relevance scoring to filter low-quality matches. - Neural weight-based memory mechanisms (e.g., Titans) consistently underperform cache-based and SSM-based alternatives in world model settings, collapsing under long-horizon imagination. (affects: Memory Encoding Taxonomies)
Potential fix: Hybrid approaches combining neural weight memory for abstract summaries with explicit caches for detailed recall.
📚 View major papers in this topic (9)
- SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation (2025-01) 9
- MEM: Multi-Scale Embodied Memory for Vision Language Action Models (2026-03) 8
- MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation (2025-08) 8
- AR-VLA: True Autoregressive Action Expert for Vision–Language–Action Models (2026-03) 8
- Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft (2025-10) 8
- Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory (2025-07) 8
- 3D Reconstruction with Spatial Memory (2024-08) 8
- Embodied-RAG: General Non-parametric Representation for Embodied Agents (2024-09) 7
- ARROW: Augmented Replay for RObust World models (2026-03) 7
💡 With memory systems deployed across textual, cognitive, and embodied domains, the Analysis theme systematically evaluates their capabilities and limitations through cognitive benchmarks, safety audits, mechanistic interpretability studies, and hardware profiling to identify where current approaches fall short.
Analysis
What: Research focused on evaluating, benchmarking, and analyzing memory systems in LLM-based agents and neural architectures, spanning cognitive benchmarks, safety audits, mechanistic interpretability, and hardware profiling.
Why: Without rigorous evaluation, memory-augmented agents may appear capable on simple recall tasks while failing on real-world demands like dynamic updates, implicit reasoning, and adversarial robustness—stalling meaningful progress.
Baseline: Standard evaluation relies on static retrieval benchmarks (e.g., needle-in-a-haystack) and single-session QA, which test simple factual lookup but miss complex capabilities like state tracking, temporal reasoning, and cross-session knowledge transfer.
- Bridging the gap between high retrieval recall (~90%) and low generation faithfulness (~60%) in memory-augmented systems
- Evaluating dynamic memory capabilities (updating, forgetting, conflict resolution) rather than static factual recall
- Ensuring memory systems are robust against adversarial manipulation (memory injection, intent legitimation) while maintaining utility
- Reconciling fragmented terminology and inconsistent evaluation protocols across the rapidly growing agent memory field
🧪 Running Example
Baseline: A static retrieval system might recall the user likes Italian food from session 5 and recommend a steakhouse near the old address, failing to track dietary evolution or the location change because it matches based on keyword similarity rather than temporal state.
Challenge: This requires integrating implicit constraints (vegetarian + nut allergy), updating stale facts (old city to new city), and reasoning about preferences that evolved across sessions—none of which is captured by standard factual recall benchmarks.
📈 Overall Progress
Memory evaluation shifted from simple factual recall to demanding cognitive reasoning, dynamic updates, and adversarial robustness—revealing current systems are far less capable than retrieval metrics suggest.
📂 Sub-topics
Memory Benchmark Design
18 papers
Papers creating evaluation frameworks and benchmarks that measure agent memory capabilities beyond simple factual retrieval, including cognitive memory, structural organization, and streaming evaluation.
Survey & Taxonomy Analysis
12 papers
Comprehensive surveys that organize the fragmented agent memory landscape into unified taxonomies, defining memory forms, functions, operations, and evaluation protocols.
Safety & Adversarial Analysis
8 papers
Research evaluating security vulnerabilities in memory systems, including memory injection attacks, intent legitimation through personalization, hidden state poisoning, and prompt interference detection.
Mechanistic & Theoretical Analysis
10 papers
Papers investigating how neural networks internally store, retrieve, and process information, including position bias theory, feed-forward layers as memory, and latent learning mechanisms.
Hardware & Infrastructure Analysis
6 papers
Analysis of physical memory bottlenecks in AI hardware, including the memory wall problem, GPU profiling for LLM inference, DRAM vulnerabilities, and processing-in-memory architectures.
Personalization & User Modeling Analysis
11 papers
Evaluation of how memory systems support personalization, including benchmarks for dynamic user profiling, inter-user difference modeling, and long-term preference tracking.
💡 Key Insights
💡 Agents with near-perfect static memory scores fail catastrophically on tasks requiring active memory-guided decisions across sessions.
💡 The 'lost in the middle' retrieval bias is a geometric property of causal attention present at initialization, not a learned artifact.
💡 Memory persistence creates novel attack surfaces where dormant injections bypass all traditional input-level safety defenses.
💡 Retrieval recall exceeding 90% masks generation faithfulness dropping to ~60%, creating a dangerous illusion of capability.
💡 Frontier models achieve only ~50% on dynamic personalization, succeeding at static facts but failing on evolving user states.
💡 Hardware memory bandwidth—not compute—is the primary LLM inference bottleneck, with a widening scaling gap per generation.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
The field evolved from foundational mechanism analysis (2021-2023) through the first long-term benchmarks and taxonomy proposals (2024-2025) to sophisticated cognitive evaluations and safety audits (2025-2026) that expose the critical gap between retrieval recall and genuine understanding.
- Geva et al. (FFN-as-KV, 2021) demonstrated that transformer feed-forward layers function as key-value memory stores with compositional retrieval across layers
- Kim et al. (RowHammer, 2023) provided a comprehensive retrospective on DRAM read disturbance, showing >80% of commodity modules are vulnerable with worsening trends over a decade
- (MemGPT, 2023) pioneered virtual context management treating the context window as RAM, achieving +60.4% accuracy on deep memory retrieval
- Talk2(Talk2Drive, 2023) demonstrated the first LLM-based personalization system controlling a real autonomous vehicle, reducing driver takeover by 75.9%
- (LoCoMo, 2024) created the first very long-term conversational memory benchmark (300+ turns), revealing models lag behind humans by 56-73% on memory tasks
- (MemWall, 2024) quantified the fundamental 20-year divergence between compute (3.0×/2yr) and bandwidth (1.6×/2yr) scaling, establishing memory as the primary AI bottleneck
- Zhang et al. (AgentMemSurvey, 2024) proposed a unified taxonomy for agent memory organized by sources, forms, and operations
- (MemSim, 2024) introduced Bayesian-causal data synthesis for reliable memory evaluation with >99% ground truth correctness
- (AgentS, 2024) introduced experience-augmented hierarchical planning with narrative and episodic memory, achieving 83.6% relative improvement on OSWorld
- (CM-MI, 2025) demonstrated >80% attack success via memory injection in DeFi agents, showing traditional prompt defenses fail against persistent memory attacks
- (PersonaMem, 2025) showed frontier models achieve only ~50% on dynamic personalization despite strong static recall (60-70%)
- (OpTaxonomy, 2025) defined six atomic memory operations and identified KV cache optimization as a rapidly emerging research hotspot via Relative Citation Index analysis
- (MemQuadruple, 2025) proposed a unified four-part memory definition linking mechanism, evaluation, and governance
- (GenAgents1K, 2025) validated agent memory at scale with 1,000 agents achieving 0.85 correlation with human survey responses
- (MemoryArena, 2026) proved that agents with saturated static memory scores fail on interdependent multi-session tasks requiring active memory-guided decisions
- (LoCoMo-Plus, 2026) showed cognitive memory collapses across all models when implicit constraints lack lexical overlap with queries
- (LitM-Birth, 2026) mathematically proved the U-shaped position bias exists at initialization before any training, achieving 0.99 Spearman correlation with empirical data
- (AMA-Bench, 2026) demonstrated existing memory systems significantly underperform long-context baselines in agentic scenarios due to lossy compression
- (Forms-Functions, 2026) unified the field by clearly distinguishing agent memory from RAG and context engineering
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Cognitive-Science-Grounded Benchmarking | Memory evaluation should test cognitive capabilities—inference, constraint adherence, temporal reasoning—rather than just factual retrieval accuracy. | Static needle-in-a-haystack and single-turn QA benchmarks that conflate retrieval with understanding | Evaluating Very Long-Term Conversational Memory... (2024), LoCoMo-Plus (2026), MemBench (2025), Evaluating Memory in LLM Agents... (2025) |
| Streaming Evaluation with Interdependent Tasks | True memory capability is demonstrated not by recall accuracy but by using past information to improve future task completion in evolving environments. | Static benchmarks that evaluate memorization and action in isolation | Benchmarking Agent Memory in Interdependent... (2026), AMA-Bench (2026), Evo-Memory (2025) |
| Unified Memory Taxonomies | Agent memory must be analyzed along multiple orthogonal dimensions—form, function, and lifecycle operations—to enable meaningful comparison across systems. | Ad hoc, application-specific descriptions of memory that conflate RAG, context engineering, and true agent memory | Memory in the Age of... (2026), Rethinking Memory in LLM based... (2025), Memory for Autonomous LLM Agents:... (2026), Memory in Large Language Models:... (2025) |
| Adversarial Memory Safety Testing | Memory persistence creates a new attack surface where adversaries can plant long-lived malicious context that evades input-level safety filters. | Traditional prompt injection defenses (spotlighting, delimiting) that only protect the immediate input, not persistent memory | Real AI Agents with Fake... (2025), When Personalization Legitimizes Risks: Uncovering... (2026), Arbiter (2026) |
| Mechanistic Interpretability of Memory | Understanding how neural networks physically implement memory reveals fundamental architectural constraints that training alone cannot overcome. | Treating neural networks as black boxes and attributing memory failures to insufficient training data or model size | Transformer Feed-Forward Layers Are Key-Value... (2021), Lost in the Middle at... (2026), Implicit Statistical Inference in Transformers:... (2026) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| MemoryArena | Task Completion Rate | ~80% | Benchmarking Agent Memory in Interdependent... (2026) |
| LoCoMo | Memory QA Accuracy (relative to human) | ~44% of human performance | Evaluating Very Long-Term Conversational Memory... (2024) |
| PersonaMem | Multiple-choice Personalization Accuracy | ~50% | Know Me, Respond to Me:... (2025) |
⚠️ Known Limitations (5)
- Current benchmarks overwhelmingly focus on English text-based interactions, lacking coverage of multilingual, multi-modal, and non-textual memory scenarios, limiting the generalizability of findings to diverse real-world deployments. (affects: Cognitive-Science-Grounded Benchmarking, Streaming Evaluation with Interdependent Tasks)
Potential fix: Expanding benchmark coverage to multilingual interactions and multi-modal memory, as demonstrated by LoCoMo's image-sharing capabilities and XPersona's cross-lingual efforts - Automated evaluation judges suffer from position bias, order bias, and self-preference bias, causing 'spurious significance' that may invalidate memory evaluation results and lead to overconfident conclusions. (affects: Unified Memory Taxonomies, Personalization Memory Evaluation)
Potential fix: Using constraint-based evaluation that checks behavioral boundaries rather than matching reference answers, combined with human-in-the-loop validation as proposed in the memory quadruple framework - Most memory benchmarks use synthetic or controlled scenarios that lack the noise, ambiguity, and scale of real-world deployed agent interactions, potentially overstating system capabilities. (affects: Cognitive-Science-Grounded Benchmarking, Personalization Memory Evaluation)
Potential fix: AMA-Bench's approach of combining expert-annotated real-world agent logs with scalable synthetic environments provides a template for bridging this realism gap - Memory safety evaluations are category-specific, testing aligned memory-query pairs (e.g., financial memories + financial crimes), making it unclear how attacks generalize across diverse domains and memory architectures. (affects: Adversarial Memory Safety Testing)
Potential fix: Developing standardized cross-domain adversarial memory test suites covering diverse memory architectures, attack vectors, and deployment contexts - Multiple competing taxonomies (3D, operational, quadruple, forms-functions-dynamics) have been proposed without convergence on shared vocabulary, potentially perpetuating the fragmentation they aim to resolve. (affects: Unified Memory Taxonomies)
Potential fix: Community adoption of a minimal shared ontology defining core terms (episodic, semantic, procedural) with extensible dimensions for domain-specific applications
📚 View major papers in this topic (10)
- Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks (2026-02) 9
- Memory in the Age of AI Agents: A Survey (2026-12) 9
- Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias (2026-03) 9
- Real AI Agents with Fake Memories: Fatal Context Manipulation Attacks on Web3 Agents (2025-03) 9
- MemGPT: Towards LLMs as Operating Systems (2023-10) 9
- Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics (2025-05) 9
- AI and Memory Wall (2024-03) 8
- Evaluating Very Long-Term Conversational Memory of LLM Agents (2024-02) 8
- Transformer Feed-Forward Layers Are Key-Value Memories (2021-11) 8
- When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents (2026-01) 8
💡 Analysis reveals the gaps and failure modes of current memory systems, which in turn motivates the creation of standardized Benchmarks—new datasets, evaluation frameworks, and metrics specifically designed to measure dynamic memory capabilities like temporal reasoning, preference tracking, and multi-session knowledge transfer.
Benchmark
What: This topic covers papers that introduce new benchmark datasets, evaluation frameworks, and metrics specifically designed to measure the memory capabilities of LLM-based agents and personalized assistants.
Why: Without rigorous, standardized benchmarks, it is impossible to meaningfully compare memory systems or identify which capabilities (e.g., temporal reasoning, preference tracking, memory updates) remain unsolved, slowing progress in building truly persistent AI agents.
Baseline: The conventional approach evaluates memory using static, single-session question-answering tasks or simple needle-in-a-haystack retrieval tests, which fail to capture dynamic memory operations like updates, forgetting, and multi-session reasoning.
- Designing benchmarks that test dynamic memory operations (updates, overwrites, forgetting) rather than just static retrieval
- Creating realistic long-horizon evaluation scenarios that reflect how users and agents interact over weeks or months
- Distinguishing genuine memory capability from superficial pattern matching or recency bias in long contexts
- Building evaluation metrics that capture implicit reasoning (e.g., inferring intent from preferences) beyond surface-level factual recall
🧪 Running Example
Baseline: A static retrieval system might return the older 'vegetarian' preference because it was mentioned more frequently, recommending a margherita pizza while ignoring the recent dietary change to pescatarian.
Challenge: The assistant must (1) retrieve relevant dietary preferences from hundreds of past sessions, (2) recognize the temporal ordering to prioritize the most recent preference, and (3) apply this updated preference to generate a contextually appropriate restaurant recommendation.
📈 Overall Progress
Memory benchmarks evolved from static single-session recall tests to dynamic, multi-session evaluations that test temporal reasoning, preference evolution, agentic task completion, and cognitive constraint adherence.
📂 Sub-topics
Conversational Memory Benchmarks
8 papers
Benchmarks evaluating long-term memory in multi-session dialogue settings, testing retrieval, temporal reasoning, and consistency over extended conversation histories.
Personalization & User Profiling Benchmarks
10 papers
Benchmarks measuring how well LLMs track, internalize, and apply individual user preferences and evolving personas across interactions.
Agentic & Task-Based Memory Benchmarks
8 papers
Benchmarks that evaluate memory in autonomous agent settings where agents must accumulate experience across sequential tasks and use it to guide future decisions.
Structural & Cognitive Memory Evaluation
5 papers
Benchmarks testing whether agents can organize knowledge into necessary hierarchies, track mutable states, perform memory rewrites, and handle composite reasoning operations.
Safety & Adversarial Memory Benchmarks
2 papers
Benchmarks evaluating security vulnerabilities and safety risks that arise when agents rely on persistent memory, including memory injection attacks and intent legitimation.
Surveys & Unified Evaluation Frameworks
6 papers
Survey papers and meta-analyses that propose unified taxonomies, evaluation protocols, and conceptual frameworks for understanding and assessing memory in AI systems.
💡 Key Insights
💡 Frontier models achieve only ~50% accuracy on evolving persona tracking, barely above random chance on challenging distractors.
💡 Static benchmark saturation does not transfer: agents excelling at factual recall fail on action-dependent memory tasks.
💡 Cognitive memory collapses when implicit constraints have no lexical overlap with trigger queries.
💡 Persistent memory creates novel attack surfaces, increasing safety violations by up to 243% through intent legitimation.
💡 Temporal reasoning remains the weakest memory capability, with models lagging humans by 73% on causal dynamics.
💡 Automated benchmark generation via Bayesian-causal synthesis achieves >99% correctness while maintaining diversity.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
The field progressed through three waves: foundational personalization benchmarks (2023-2024), sophisticated long-context conversational and preference evaluation (2024-2025), and agentic task-stream benchmarks testing active memory usage in autonomous decision-making (2025-2026). Each wave revealed that state-of-the-art models dramatically underperform on increasingly realistic memory scenarios.
- (LaMP, 2023) established the first comprehensive personalization benchmark with 7 diverse tasks, demonstrating +23.5% improvement from retrieval-augmented personalization over generic baselines
- (Personalized Dialogue Survey, 2024) cataloged 22 datasets and identified PersonaChat as the dominant but limited benchmark, highlighting severe multilingual data scarcity
- (LoCoMo, 2024) introduced 300+ turn dialogues grounded in Temporal Event Graphs, revealing that long-context LLMs lag behind humans by 56-73% on memory tasks
- (PerLTQA, 2024) unified semantic and episodic memory in a three-stage evaluation framework with 141 synthetic characters
- (LongMemEval, 2024) defined five core memory abilities and showed commercial systems suffer 30-60% accuracy drops vs. oracle retrieval
- (MemSim, 2024) introduced Bayesian-causal data synthesis achieving >99% ground truth correctness for automated benchmark generation
- (AI Persona, 2024) proposed dynamic learnable user dictionaries and PersonaBench for life-long personalization evaluation
- (PrefEval, 2025) showed preference following drops below 10% in zero-shot settings across 3,000 manually curated preference-query pairs
- (Memory Framework, 2025) decomposed memory into atomic capabilities, showing GPT-4o drops to ~0.45 accuracy on composite Theory of Mind tasks
- (PersonaMem, 2025) demonstrated frontier models achieve only ~50% accuracy on evolving persona tracking with up to 1M token histories
- (CrAIBench, 2025) exposed memory injection attacks achieving >80% success on frontier models in DeFi tasks
- (ETAPP, 2025) introduced proactivity as a core metric for evaluating personalized tool-augmented agents
- (Evo-Memory, 2025) introduced streaming evaluation for test-time learning, with the ReMem agent achieving 0.92 success rate on navigation benchmarks
- (Memory in LLMs, 2025; Agent Memory Survey, 2025) proposed unified taxonomies distinguishing agent memory from RAG and context engineering
- (ATOD, 2026) introduced dependency-aware goal completion metrics and dual-store evaluation for multi-goal dialogue agents, achieving 25-30% higher accuracy than LLM judges
- (PS-Bench, 2026) identified intent legitimation as a novel safety failure, showing personalization increases attack success by up to 243.7%
- (RPEval, 2026) revealed an 'inverse scaling' effect where more capable models are worse at ignoring irrelevant preferences, achieving ~35% improvement with pragmatic reasoning
- (MemoryArena, 2026) shifted evaluation to interdependent multi-session tasks, revealing that agents saturating static benchmarks fail on action-dependent memory
- (AMA-Bench, 2026) addressed agent-specific memory challenges with causality graphs, outperforming memory baselines by 11.16%
- (StructMemEval, 2026) exposed that modern LLMs cannot spontaneously organize knowledge into required hierarchical structures
- (LoCoMo-Plus, 2026) revealed cognitive memory collapse when testing implicit constraints with semantic disconnect from surface queries
- (LifeSim, 2026) modeled users as BDI cognitive agents, showing GPT-5 drops 27.3 points from explicit to implicit intent recognition
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Long-Context Conversational Memory Evaluation | Embed answer-critical evidence within realistic, extended multi-session dialogue histories and measure recall, temporal reasoning, and update capabilities. | Static single-session QA benchmarks and simple needle-in-a-haystack retrieval tests | Evaluating Very Long-Term Conversational Memory... (2024), LongMemEval (2024), LoCoMo-Plus (2026), PerLTQA (2024) |
| Dynamic User Profile Benchmarking | Evaluate personalization by testing whether models prioritize the most recent user state over outdated historical information when both exist in context. | Static user profile datasets like PersonaChat where personas never change | LaMP (2023), Know Me, Respond to Me:... (2025), Do LLMs Recognize Your Preferences?... (2025), LifeSim (2026) |
| Agentic Task-Stream Memory Evaluation | Shift evaluation from passive recall accuracy to active task completion rate, where success depends on correctly leveraging memories from prior sessions. | Dialogue-centric memory benchmarks that ignore machine-generated action logs and environment interactions | Benchmarking Agent Memory in Interdependent... (2026), AMA-Bench (2026), Evo-Memory (2025) |
| Structural & Cognitive Memory Testing | Decompose memory into atomic capabilities (search, edit, state tracking, forgetting) and test each independently before combining them into composite evaluations. | Benchmarks that test only unstructured retrieval, which can be solved by simple similarity search without genuine memory organization | Evaluating Memory Structure in LLM... (2026), How Effectively Can AI Assistants... (2025), Memory Retention Is Not Enough... (2026) |
| Automated Benchmark Generation | Separate structured truth generation from text generation to prevent hallucination in benchmark datasets while maintaining diversity and scalability. | Manually curated benchmarks that are static, expensive to create, and susceptible to contamination | MemSim (2024), LifeSim (2026), How Effectively Can AI Assistants... (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| LongMemEval | QA Accuracy | +5.4% QA accuracy, +9.4% recall | LongMemEval (2024) |
| LoCoMo | Accuracy (human-relative) | 44% of human performance on memory QA | Evaluating Very Long-Term Conversational Memory... (2024) |
| AMA-Bench | Average Accuracy | 57.22% | AMA-Bench (2026) |
⚠️ Known Limitations (5)
- Most benchmarks rely on synthetic data that may not capture the full complexity and messiness of real-world user interactions, limiting ecological validity of results. (affects: PersonaMem, MemSim, LifeSim, MemoryArena)
Potential fix: Hybrid pipelines combining LLM generation with human annotation and grounding in real behavioral data (as attempted by LoCoMo and LifeSim) can improve realism. - Evaluation metrics often reduce complex memory behaviors to single accuracy scores, missing nuanced failure modes like partial recall, outdated information retrieval, or correct reasoning with wrong evidence. (affects: LongMemEval, LoCoMo, LaMP)
Potential fix: Constraint-consistency evaluation (as proposed by LoCoMo-Plus) and decomposed atomic capability testing (as in the programmable memory framework) offer more diagnostic alternatives. - Benchmark contamination risk is high since many test scenarios can be memorized during pre-training, making it unclear whether models genuinely reason or simply recall training data. (affects: LaMP, PrefEval, PerLTQA)
Potential fix: Parametric, randomized test generation (as in the programmable memory framework) prevents overfitting by producing unique instances for each evaluation run. - Most benchmarks evaluate memory in isolation from the full agent loop, testing retrieval accuracy rather than downstream task performance, which can overestimate practical utility. (affects: LongMemEval, PerLTQA, StructMemEval)
Potential fix: MemoryArena and AMA-Bench address this by evaluating memory through end-to-end task completion in interactive environments. - Cross-benchmark comparison is difficult due to inconsistent terminology, different memory type definitions, and varying evaluation protocols across papers. (affects: All benchmark methods)
Potential fix: Unified taxonomies like the Memory Quadruple framework and Forms-Functions-Dynamics classification aim to standardize definitions and enable fair cross-benchmark comparison.
📚 View major papers in this topic (10)
- LaMP: When Large Language Models Meet Personalization (2023-04) 9
- Evaluating Very Long-Term Conversational Memory of LLM Agents (2024-02) 8
- LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory (2024-10) 8
- Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks (2026-02) 9
- Memory in Large Language Models: Mechanisms, Evaluation and Evolution (2025-09) 9
- Memory in the Age of AI Agents: A Survey (2025-12) 9
- Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs (2025-02) 8
- LoCoMo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents (2026-02) 8
- How Effectively Can AI Assistants Utilize Their Memory? A Framework for Extensive Evaluation (2025-03) 8
- AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications (2026-02) 8
💡 Benchmarks quantify memory capabilities in controlled settings, but the real test comes in Application, where memory techniques—persistent storage, adaptive retrieval, and experience reuse—are deployed in demanding real-world domains from autonomous driving to code synthesis and multi-agent coordination.
Application
What: This topic covers papers that apply memory techniques—persistent storage, adaptive retrieval, caching, and experience reuse—to specific domains such as autonomous driving, travel planning, code synthesis, mathematical discovery, and multi-agent coordination.
Why: As LLM-based agents move from general chatbots to domain-specific autonomous systems, memory becomes the critical enabler for personalization, multi-session consistency, and learning from experience without retraining.
Baseline: The conventional approach uses a fixed-context LLM that treats each interaction independently, relying on in-context examples or fine-tuning rather than dynamic, persistent memory across sessions.
- Bridging the gap between generic memory mechanisms and domain-specific requirements (e.g., causal constraints in agent workflows, real-time driving decisions)
- Scaling memory systems to handle long-horizon, machine-generated interaction logs rather than short human dialogues
- Ensuring memory security against adversarial manipulation while maintaining retrieval effectiveness
- Balancing memory overhead with inference efficiency on resource-constrained devices
🧪 Running Example
Baseline: A standard LLM-based driving system would either ignore the abstract command entirely or interpret it literally each time without remembering that this user prefers aggressive acceleration and highway routes when expressing urgency, leading to repeated unsatisfactory experiences.
Challenge: The system must (1) interpret an abstract verbal command as concrete driving parameters, (2) remember past interactions and user feedback to personalize future responses, and (3) adapt to changing conditions while respecting learned preferences—all in a safety-critical real-time domain.
📈 Overall Progress
Memory in LLM applications evolved from simple interaction logging to adaptive, value-driven retrieval systems that learn what to remember from environmental feedback.
📂 Sub-topics
Personalized Domain-Specific Agents
4 papers
Papers applying memory to build agents personalized to specific domains including autonomous driving, travel planning, and survey editing, where memory enables adaptation to individual user preferences over time.
Autonomous Agent Memory Systems
6 papers
Papers developing memory architectures for autonomous agents tackling complex multi-step tasks in code synthesis, mathematical discovery, and multi-agent coordination.
Efficient Memory Access and Caching
3 papers
Papers optimizing memory access patterns and caching strategies for efficient inference, including cross-layer index reuse, predictive caching for mobile devices, and processing-in-memory hardware.
Memory Security and Theoretical Foundations
4 papers
Papers addressing adversarial vulnerabilities of memory-augmented agents and theoretical frameworks for understanding memory in neural and biological systems.
💡 Key Insights
💡 Memory-augmented agents dramatically outperform stateless baselines in domain-specific tasks, with gains of 65–75% in user satisfaction.
💡 Learned retrieval utility (Q-values) vastly outperforms semantic similarity for selecting relevant memories in agentic workflows.
💡 Existing memory systems designed for dialogue fail on autonomous agent tasks due to machine-generated, symbol-heavy interaction logs.
💡 Memory systems introduce new attack surfaces: embedding-space poisoning can hijack agent behavior with under 0.1% data contamination.
💡 Biological brain memory and Transformer attention are mathematically equivalent, suggesting principled paths for future memory design.
💡 Proactive cache population during idle time dramatically improves hit rates for mobile and resource-constrained deployments.
📖 Show full analysis (timeline, methods, benchmarks)
📅 Timeline
Early work (2023–2024) demonstrated memory's value in specific domains like driving and travel. By 2025, scalable architectures (Mem0, PerCache) and standardized protocols (MCP) emerged. In 2026, the field shifted toward adaptive memory with learned retrieval policies (Q-Memory) and causality-aware storage (AMA-Agent), while benchmarks revealed that existing memory systems still fall far short on autonomous agent tasks.
- Talk2(Talk2Drive, 2023) demonstrated the first LLM-based personalization system on a real autonomous vehicle, using memory of past interactions to reduce driver takeover rates by 75.9%
- (BB-LDPC, 2023) achieved a 10x reduction in quantum memory overhead by encoding 12 logical qubits in only 288 physical qubits
- (AgentPoison, 2024) revealed that RAG-based agent memory can be hijacked through embedding-space poisoning, achieving 80%+ attack success across driving, QA, and healthcare agents
- (TravelAgent, 2024) introduced modular agent architecture with dedicated memory for constraint-aware travel planning, achieving 90% rationality vs. 50% for GPT-4
- (PIM-Opt, 2024) demonstrated that processing-in-memory hardware achieves 3.19x speedup over GPU for ML training by minimizing data movement
- Mem0 (Mem0, 2025) introduced dynamic memory management with graph enhancements, achieving 26% improvement over OpenAI while reducing latency by 91%
- (PerCache, 2025) pioneered predictive hierarchical caching for mobile RAG, reducing latency by 34.4% through proactive cache population
- (KV-Brain, 2025) formalized the mathematical equivalence between biological hippocampal memory and Transformer self-attention
- (MCP, 2025) proposed a standardized protocol for shared context across multi-agent systems, acting as a universal connector for AI memory
- (EvoKernel, 2026) introduced Q-value-driven memory retrieval for NPU kernel synthesis, boosting correctness from 11% to 83% without fine-tuning
- (AMA-Bench, 2026) established the first benchmark for long-horizon agent memory, revealing that existing memory systems significantly underperform on agentic tasks
- (NeurosymCollab, 2026) used progressive disclosure persistent memory to enable multi-session mathematical discovery, proving new combinatorial bounds
- (IndexCache, 2026) achieved 1.82x prefill speedup by reusing token selection indices across transformer layers
- (BAO, 2026) formulated proactive agent training as multi-objective optimization with retrospective memory and prospective planning
🔬 Key Methods
| Method | Key Innovation | Improves On | Papers |
|---|---|---|---|
| Persistent Agent Memory | Store interaction triples (input, action, feedback) persistently and retrieve them to personalize future agent behavior without updating model weights. | Stateless LLM inference that treats each session independently, losing all context between interactions. | Personalized Autonomous Driving with Large... (2023), Mem0 (2025), Agentic Neurosymbolic Collaboration for Mathematical... (2026) |
| Value-Driven Memory Retrieval | Learn Q-values for memory items so the agent retrieves memories based on predicted utility rather than surface-level semantic similarity. | Semantic similarity-based retrieval (e.g., cosine similarity over embeddings) that ignores task-specific utility. | Towards Cold-Start Drafting and Continual... (2026) |
| Causality-Aware Agent Memory | Replace similarity-based memory storage with a causality graph that preserves state transitions, enabling retrieval of causally relevant rather than textually similar experiences. | Vector-based RAG and semantic similarity retrieval that lose causal structure when compressing agent interaction logs. | AMA-Bench (2026) |
| Cross-Layer Index Reuse | Important tokens remain stable across adjacent transformer layers, so token selection indices can be shared, eliminating 75% of indexer computations. | Standard sparse attention (e.g., DeepSeek Sparse Attention) where every layer independently runs a quadratic-cost token indexer. | IndexCache (2026) |
| Predictive Hierarchical Caching | Proactively predict and cache future queries at multiple levels of the RAG pipeline during device idle time, rather than reactively caching after queries arrive. | Reactive single-level caches (KV cache or semantic cache) that achieve low hit rates under sparse mobile query patterns. | PerCache (2025) |
📊 Benchmark Results
| Benchmark | Metric | Best Result | Paper |
|---|---|---|---|
| AMA-Bench | Accuracy (%) | 72.26% | AMA-Bench (2026) |
| KernelBench (Ascend C NPU) | Pass@k (%) | 83.0% | Towards Cold-Start Drafting and Continual... (2026) |
| LOCOMO | LLM-as-Judge Score | 26% relative improvement | Mem0 (2025) |
⚠️ Known Limitations (4)
- Memory security is largely unaddressed: agents relying on external memory or RAG are vulnerable to poisoning attacks that manipulate behavior without detection, posing serious risks in safety-critical domains like autonomous driving and healthcare. (affects: Persistent Agent Memory, Modular Agent Architecture with Memory)
Potential fix: Embedding-space anomaly detection, provenance tracking for memory items, and adversarial training of retrieval models. - Existing memory systems lose causal structure when compressing agent interaction logs, leading to retrieval of semantically similar but causally irrelevant experiences that mislead agent decision-making. (affects: Persistent Agent Memory, Modular Agent Architecture with Memory)
Potential fix: Causality graphs (as in AMA-Agent) and structured memory schemas that preserve state transitions and dependencies. - Domain-specific memory applications are validated on narrow benchmarks and small-scale deployments, making it unclear how well they generalize across domains or scale to millions of users. (affects: Persistent Agent Memory, Value-Driven Memory Retrieval, Predictive Hierarchical Caching)
Potential fix: Cross-domain transfer studies, standardized evaluation frameworks like AMA-Bench, and large-scale deployment experiments. - Memory overhead grows with interaction history, creating tension between comprehensive memory and inference efficiency, especially on mobile and edge devices where compute and storage are severely constrained. (affects: Persistent Agent Memory, Predictive Hierarchical Caching, Cross-Layer Index Reuse)
Potential fix: Adaptive memory pruning, hierarchical storage tiers, and resource-aware scheduling that dynamically manages memory capacity.
📚 View major papers in this topic (9)
- Agentic Neurosymbolic Collaboration for Mathematical Discovery: A Case Study in Combinatorial Design (2026-03) 9
- Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis (2026-03) 9
- High-threshold and low-overhead fault-tolerant quantum memory (2023-08) 9
- Personalized Autonomous Driving with Large Language Models: Field Experiments (2023-12) 8
- AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases (2024-07) 8
- AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications (2026-02) 8
- Key-value memory in the brain (2025-01) 8
- Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization (2026-02) 8
- Mem0: Memory-Centric Architecture for Large Language Models (2025-04) 7
💡 As memory applications proliferate across domains, Survey papers provide the essential synthesis—unifying fragmented terminology, establishing comprehensive taxonomies, and mapping the open challenges and future directions that span the entire memory research ecosystem.
Survey
- A Survey on the Memory Mechanism of Large Language Model based Agents (2024-04) 7
- A Survey of Personalized Large Language Models: Progress and Future Directions (2025-02) 7
- From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs (2025-04) 7
- Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics (2025-05) 9
- MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents (2025-06) 7
- Memory in Large Language Models: Mechanisms, Evaluation and Evolution (2025-09) 9
- Memory in the Age of AI Agents: A Survey (2025-12) 9
- Memory for Autonomous LLM Agents: Mechanisms, Evaluation, and Emerging Frontiers (2026-03) 9
- Memory in the Age of AI Agents: A Survey (2026-12) 9
🎯 Practical Recommendations
| Priority | Recommendation | Evidence |
|---|---|---|
| High | Use reinforcement learning to train memory management policies rather than hand-crafting heuristics for what to store and discard. RL-trained policies consistently outperform static rules and can generalize to context lengths 10–400× beyond their training, making them more robust across diverse deployment scenarios. | Memory-R1 achieved +28.5% F1 with only 152 training examples, Mem-α generalized from 30K training to 400K+ tokens, and MemAgent extrapolated from 8K to 3.5M tokens with <5% loss. |
| High | Adopt layered memory architectures that separate episodic (event-specific) from semantic (abstracted knowledge) memory stores, with active consolidation and forgetting mechanisms. This cognitive-science-inspired approach consistently outperforms flat vector retrieval for both personalization and long-horizon tasks. | Synapse achieved 40.5 F1 on LoCoMo with 95% fewer tokens using episodic-semantic separation. PRIME demonstrated that semantic memory outperforms episodic for capturing user traits. LightMem reduced tokens by 38× while improving accuracy by 29.3% through sensory filtering and sleep-time consolidation. |
| High | Treat the LLM context window as a scarce cache resource (like CPU L1 cache) rather than unlimited storage. Apply OS-inspired demand paging to evict stale content to backing stores and page it back in on demand, dramatically reducing context waste. | Pichay reduced context consumption by 93% in production sessions with only 0.025% page fault rate. MemoryOS achieved +49% F1 on LoCoMo with three-tier memory hierarchy and heat-based eviction. |
| High | Use graph-based memory structures with spreading activation or multi-graph architectures for tasks requiring multi-hop reasoning. Graph retrieval surfaces structurally connected but semantically distant memories that standard vector similarity search misses entirely. | HippoRAG improved multi-hop QA by 20% at 10–20× lower cost. MAGMA with four disentangled graph layers (semantic, temporal, causal, entity) outperformed MemoRAG and Hi-Mem. AssoMem improved recall by 24.93% in similarity-dense scenarios. |
| High | Implement memory security governance for any agent with persistent memory, including verification protocols, ground-truth anchoring against immutable observation ledgers, and monitoring for intent legitimation attacks. Memory injection attacks succeed at 98% rates through normal queries alone. | MINJA demonstrated 98.2% memory injection success via query-only interaction. PS-Bench showed benign memories increase attack success rates by up to 243.7%. SSGM proposes decoupling memory evolution from governance with verification protocols. |
| Medium | Combine parametric (LoRA-based) and non-parametric (retrieval-based) memory for personalization. Per-user LoRA adapters capture implicit behavioral patterns while retrieval provides up-to-date factual context, and the combination consistently outperforms either approach alone. | OPPU achieved state-of-the-art across all 7 LaMP tasks by combining both approaches. Comparing RAG and PEFT confirmed they are complementary with +1.06% gain from combination. |
| Medium | Evaluate memory systems using action-coupled benchmarks that test whether recalled information improves downstream task completion, not just retrieval accuracy. Static recall scores dramatically overestimate real-world capability. | MemoryArena showed agents with saturated static scores fail on interdependent tasks. LoCoMo-Plus revealed cognitive memory collapses across all models when implicit constraints are tested. GPT-4o drops to ~45% on composite Theory of Mind tasks. |
| Medium | For edge and mobile deployments, persist agent KV caches in quantized (4-bit) format to disk and use predictive pre-fetching during idle time. This transforms multi-agent edge deployment from infeasible to practical with 136× latency reduction for agent switching. | Persistent Q4 KV Cache reduced time-to-first-token by 136× on Apple M4 Pro. PerCache reduced mobile RAG latency by 34.4% through proactive query prediction during idle time. |
🔑 Key Takeaways
Memory Is Now a Learnable Skill
Memory management has shifted from a passive engineering problem to an active cognitive capability that agents can learn through reinforcement learning. RL-trained memory policies that learn what to store, update, and delete from task outcomes consistently outperform hand-crafted heuristics. Remarkably, these policies generalize dramatically—models trained on 30K tokens perform well at 400K+, and 8K training context extrapolates to 3.5M tokens.
Agents that learn to forget outperform agents that remember everything.
OS Concepts Power AI Memory
Operating system memory management principles—virtual memory paging, demand loading, cache hierarchies, and context switching—transfer remarkably well to LLM memory. PagedAttention eliminated 60–80% KV cache waste and became the serving standard, while Pichay's demand paging reduces agent context consumption by 93%. This OS-to-AI translation has become one of the most productive paradigms in the field.
The best AI memory systems are built like operating systems, not databases.
Persistent Memory Creates New Attack Surfaces
As memory systems become more capable, they introduce novel security vulnerabilities that traditional prompt-level defenses cannot address. Memory injection attacks succeed at 98% rates through normal queries alone. Even benign personal memories can bypass safety filters through 'intent legitimation,' increasing attack success by up to 243%. Memory governance with verification protocols and ground-truth anchoring is essential for safe deployment.
Every memory an agent stores is a potential weapon an adversary can exploit.
Static Benchmarks Mask Real Failures
Agents scoring near-perfectly on standard memory recall benchmarks fail dramatically when memory must actively guide multi-session decisions. Frontier models achieve only ~50% on dynamic personalization tasks and lag humans by 56–73% on long-term conversational memory. The gap between retrieval recall (90%+) and generation faithfulness (~60%) creates a dangerous illusion of capability that action-coupled evaluation frameworks are now exposing.
High retrieval accuracy hides the fact that models cannot actually use what they remember.
Cognitive Science Inspires Best Architectures
The most effective memory systems consistently draw from cognitive science—separating episodic from semantic memory, implementing Ebbinghaus forgetting curves for active decay, using hippocampal-inspired graph indexing for associative retrieval, and applying spreading activation for multi-hop reasoning. These biologically-grounded designs outperform engineered alternatives, with Synapse achieving 95% token reduction while improving accuracy through cognitive-inspired dual-layer graph dynamics.
The brain's memory blueprint remains the most reliable guide for building AI memory.
Small Models Beat Giants with Smart Memory
Well-designed memory architectures consistently enable small models to outperform much larger ones. A 4B model with RL-trained memory outperforms GPT-5 on personalization using 16× fewer tokens. External procedural memory built in 56 seconds outperforms models 10× larger. A 7B model with dual memory surpasses GPT-4 on tool use. Memory architecture matters more than model scale for persistent tasks.
The right memory can make a small model smarter than a giant one.
🚀 Emerging Trends
Self-context engineering is emerging as a new paradigm where models actively curate their own working memory by reading information, distilling it into compact notes, and deleting raw source material. This 'Pensieve' paradigm enables unlimited effective context through a sawtooth memory profile, achieving 10× improvements on deep research tasks where standard LLMs fail completely.
StateLM introduced models that use deleteContext tools to self-manage their context, achieving 52% on BrowseComp-Plus versus 5% for standard LLMs. SUPO made summarization a learnable RL action, enabling test-time scaling to 23 summary steps. Agentic Context Engineering decomposed context into structured bullets with role-separated adaptation.
📄 StateLM: To the Rescue of Long-Horizon Reasoning with Recursive Memory (2026), Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management (2025), Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models (2025)
Memory governance and constitutional memory architectures are emerging as critical infrastructure for deploying persistent AI agents. Researchers are proposing formal frameworks where core identity memories are immutable, memory updates require verification protocols, and ground-truth anchoring periodically reconciles evolved memory against observation ledgers to prevent semantic drift.
SSGM formalized memory governance as a POMDP with decay modeling and ground-truth anchoring. Memory-as-Ontology proposed that memory constitutes agent identity with constitutional governance layers. The Turn language made memory isolation compiler-enforced rather than relying on fragile conventions.
📄 Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the SSGM Framework (2026), Memory as Ontology: A Constitutional Memory Architecture for Persistent Digital Citizens (2026), Turn: A Language for Agentic Computation (2026)
Multi-graph disentangled memory architectures that maintain parallel relationship layers (semantic, temporal, causal, entity) are replacing monolithic knowledge graphs and flat vector stores. These architectures enable intent-aware retrieval where the system selects which relationship type to prioritize based on query type—'why' queries use causal edges, 'when' queries use temporal edges.
MAGMA introduced four parallel graphs with policy-guided traversal and dual-stream evolution. Synapse unified episodic-semantic memory with spreading activation and lateral inhibition, reducing tokens by 95%. AssoMem fused graph-based importance with temporal alignment using adaptive mutual information weighting.
📄 MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents (2026), Synapse: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation (2026), AssoMem: An Associative Memory Framework for Context-Aware Memory Recall (2025)
Associative memory networks based on Hopfield network theory are scaling as principled alternatives to Transformer attention, offering transparent compositional reasoning and superior long-context performance. At 10B parameters and 1T training tokens, these architectures outperform standard Transformers by 12–15% on multi-document QA while providing interpretable memory heads.
Memory Mosaics at scale outperformed Transformers on multi-document QA at 32K context. The mathematical equivalence between self-attention and Hopfield update rules provides theoretical grounding. Routing without Forgetting applied Hopfield Pooling for instant per-sample adaptation in continual learning.
📄 Memory Mosaics at scale (2025), In-Context Exemplars as Clues to Retrieving from Large Associative Memory (2023), Routing without Forgetting (2026)
🔭 Research Opportunities
Unified memory evaluation frameworks that test dynamic operations (updating, forgetting, conflict resolution) across conversational, agentic, and multi-modal settings, rather than the fragmented static recall benchmarks that currently dominate the field.
Current benchmarks are scattered across different tasks, metrics, and LLM backends, making cross-method comparison nearly impossible. Static recall scores overestimate capability by 30–40%, and most benchmarks ignore critical operations like memory rewriting and selective forgetting that are essential for real-world deployment.
Difficulty: Medium Impact: HighPrivacy-preserving memory architectures that provide provable data minimization guarantees while maintaining personalization quality, addressing the growing tension between memory capability and user privacy as agents store increasingly sensitive personal information.
Current memory systems have no established mechanisms for memory governance, user consent, or secure inheritance when models are upgraded. Memory extraction attacks can recover private information stored in agent memory, and the field lacks formal privacy frameworks for persistent memory.
Difficulty: High Impact: HighMulti-modal memory recall systems that can retrieve and reason over visual, audio, and video memories from past interactions, extending beyond the text-only focus of current memory architectures to match how humans naturally encode and recall experiences.
Most memory evaluation benchmarks focus exclusively on text-based recall, yet real-world personal assistants capture vast streams of multi-modal data (photos, videos, audio). Only Pensieve has addressed multi-modal memory QA, with 14% improvement over standard approaches, suggesting substantial untapped potential.
Difficulty: High Impact: HighMemory consistency protocols for multi-agent systems, analogous to cache coherence in multiprocessor hardware, that guarantee agents see up-to-date and non-contradictory shared information when reading and writing concurrently.
As multi-agent deployments grow, the lack of formal consistency guarantees means agents can act on stale or contradictory information. Computer architecture has decades of cache coherence protocols that could be adapted, but no existing agent system provides formal consistency guarantees.
Difficulty: High Impact: HighImplicit preference extraction and cognitive memory that can capture user preferences expressed through behavior rather than explicit statements, closing the current 27-point gap between explicit and implicit intent recognition observed in frontier models.
Frontier models achieve only ~50% on dynamic personalization tasks requiring evolving user tracking, and cognitive memory collapses across all models when implicit constraints lack lexical overlap with queries. RL-trained memory shows promise (PersonaMem-v2 outperforms GPT-5), but the general problem remains far from solved.
Difficulty: Medium Impact: HighTransferable memory policies that generalize across different LLM architectures, domains, and deployment environments without requiring retraining, addressing the current limitation that RL-based memory management is tightly coupled to specific models and task distributions.
While RL-trained memory policies show impressive within-distribution generalization (8K training to 3.5M tokens), cross-architecture and cross-domain transfer remains largely unexplored. Universal memory models like NAMMs that operate on attention patterns rather than token embeddings offer a promising direction.
Difficulty: Medium Impact: Medium🏆 Benchmark Leaderboard
LoCoMo
Long-term conversational memory across 300+ turn multi-session dialogues, testing factual recall, temporal reasoning, multi-hop inference, and adversarial robustness (Metric: F1 Score)
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| 🥇 | MemoryOS (segmented paging + heat eviction) | +49.11% F1 over baselines — +49% average F1 improvement using GPT-4o-mini | MemoryOS (2025) | 2025 |
| 🥈 | Synapse (spreading activation + lateral inhibition) | 40.5 Weighted F1 — +21.6% over A-Mem with 95% fewer tokens | Synapse (2026) | 2026 |
| 🥉 | Memory-R1 (GRPO-based RL) | +28.5% F1 over MemoryOS baseline — +28.5% using only 152 training examples | Memory-R1 (2025) | 2025 |
AMA-Bench (Agent Memory)
Long-horizon agent memory over machine-generated interaction logs spanning SQL queries, web navigation, and programmatic environments (Metric: Average Accuracy)
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| 🥇 | GPT-5.2 (long-context, upper bound) | 72.26% — Frontier model still far from perfect, indicating significant room for improvement | AMA-Bench (2026) | 2026 |
| 🥈 | AMA-Agent (causality graph + tool-augmented retrieval) | 57.22% — +11.16% over strongest memory system baseline | AMA-Bench (2026) | 2026 |
ALFWorld (Household Tasks)
Interactive household task completion requiring procedural memory and multi-step planning in text-based environments (Metric: Success Rate)
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| 🥇 | MACLA (contrastive procedural memory) | 90.3% — +3.1% positive generalization gap on unseen tasks | MACLA (2025) | 2025 |
| 🥈 | UMEM (unified extraction + management) | 82.84% — Monotonic performance growth during continuous evolution | UMEM (2026) | 2026 |
GAIA (General AI Assistants)
General AI assistant capabilities requiring multi-step reasoning, tool use, and web browsing with persistent memory (Metric: Pass@3)
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| 🥇 | Memento (memory-augmented MDP) | 87.88% Pass@3 — +4.7 to +9.6% over baselines on out-of-distribution tasks | Memento (2025) | 2025 |
| 🥈 | AGENTKB + smolagents | 73.9% — +18.7pp over smolagents baseline (55.2%) | AGENTKB (2025) | 2025 |
PersonaMem (Dynamic Personalization)
Tracking evolving user personas over long interaction histories up to 1M tokens with both explicit and implicit preference signals (Metric: Multiple-Choice Accuracy)
| Rank | Method | Score | Paper | Year |
|---|---|---|---|---|
| 🥇 | PersonaMem-v2 (RL agentic memory) | 55% on implicit personalization — Outperforms GPT-5 (~40-48%) with 16× fewer tokens | PersonaMem-v2 (2025) | 2025 |
| 🥈 | Frontier models (GPT-4.5, Gemini-1.5, o1) | ~50% on PersonaMem — Only ~25% above 25% random baseline | Know Me, Respond to Me:... (2025) | 2025 |