📖 What is Memory-Augmented LLMs?

Memory in LLMs addresses how AI systems store, organize, retrieve, and evolve information across interactions to enable personalization, long-term coherence, and learning from experience.

💡 Why it Matters

As LLMs evolve from stateless single-turn tools to persistent agents managing long-running tasks, effective memory becomes critical for maintaining coherence across sessions, personalizing responses to individual users, and enabling agents to learn from accumulated experience without costly retraining.

🎯 Key Paradigms

Memory Organization

Research on structuring and storing past interactions and knowledge for LLM agents, covering linear buffers with learned eviction, multi-layered hierarchies inspired by human cognition, tree/graph-based associative structures, parameter internalization via LoRA adapters, and compression through summarization and consolidation.

Memory Recall

Methods for retrieving information from past interactions and logs to answer users' recall questions and help them remember multi-modal memories, spanning sparse and dense memory QA, conversational memory retrieval, multi-modal recall, and temporal-episodic reconstruction.

Memory for Agentic Systems

Memory systems designed specifically for LLM-based agents, enabling persistent state across sessions, experience replay and reflection for continuous improvement, memory-augmented planning, coordinated memory sharing across multi-agent teams, and formal evaluation of agent memory effectiveness.

📚 Related Fields

📅 Field Evolution Timeline

2015-03 to 2023-12 Foundations

Foundational memory-augmented architectures, OS-inspired KV cache management, and early cognitive memory models

  • End-to-End Memory Networks (MemN2N, 2015) introduced fully differentiable multi-hop attention over external memory, establishing the paradigm of differentiable memory access that influenced all subsequent memory-augmented architectures.
  • Transformer Feed-Forward as Key-Value Memory (FFN-as-KV, 2021) reinterpreted two-thirds of Transformer parameters as key-value memory stores, providing the theoretical foundation for knowledge editing and memory manipulation research.
  • PagedAttention (vLLM, 2023) revolutionized KV cache management by applying OS virtual memory paging concepts to GPU memory, achieving 2–4× throughput improvement and becoming the standard for production LLM serving.
  • MemGPT (MemGPT, 2023) pioneered treating the LLM context window as RAM with external storage as disk, enabling self-directed memory paging and inspiring the OS-inspired memory paradigm adopted by subsequent systems.
Memory shifted from static external stores to differentiable, self-managed systems integrated into the model's reasoning loop.
2024-01 to 2024-12 Systematization

Systematic taxonomies for agent memory, neuroscience-inspired retrieval, memory-efficient training breakthroughs, and first long-term benchmarks

  • The first comprehensive agent memory survey (Memory Survey, 2024) established a unified taxonomy deconstructing memory into Sources, Forms, and Operations, providing shared vocabulary for the fragmented field.
  • HippoRAG (HippoRAG, 2024) introduced neurobiologically-inspired graph retrieval using knowledge graphs with Personalized PageRank, achieving 20% gains on multi-hop QA at 10–20× lower cost than iterative methods.
  • GaLore (GaLore, 2024) democratized LLM pre-training by projecting gradients into low-rank subspaces, reducing optimizer memory by 65% and enabling 7B model training on a single 24GB consumer GPU.
  • LoCoMo (LoCoMo, 2024) established the first very long-term conversational memory benchmark with 300+ turn dialogues, revealing that even frontier LLMs lag behind humans by 56–73% on memory tasks.
Memory research transitioned from ad-hoc implementations to systematic taxonomies and rigorous long-term evaluation.
2025-01 to 2025-12 RL-Driven Memory

Reinforcement learning emerges as the dominant training paradigm for memory management, with breakthroughs in experience-based self-evolution, implicit personalization, and cross-framework knowledge sharing

  • MACLA (MACLA, 2025) demonstrated that external procedural memory built in 56 seconds can outperform models 10× larger, achieving 78.1% average across four benchmarks through contrastive refinement of success/failure pairs.
  • Memory-R1 (Memory-R1, 2025) was the first to apply GRPO-based RL to memory ADD/UPDATE/DELETE operations, achieving +28.5% F1 improvement on LoCoMo with only 152 training examples.
  • PersonaMem-v2 (PersonaMem-v2, 2025) showed that RL-trained agentic memory enables a 4B model to outperform GPT-5 on implicit personalization while using 16× fewer tokens, establishing a new paradigm for efficient user modeling.
  • AGENTKB (AGENTKB, 2025) created a universal cross-framework memory layer enabling knowledge transfer across incompatible agent architectures, with +18.7pp improvement on GAIA and +17.0pp on SWE-bench Lite.
Memory management shifted from static heuristics to learned RL policies that autonomously decide what to store, update, and forget based on task-outcome rewards.
2026-01 to 2026-03 Active Self-Management

Models learn to actively manage their own context, formal memory governance emerges, and evaluation reveals critical gaps between static recall and active memory-guided decisions

  • StateLM (StateLM, 2026) introduced the Pensieve paradigm where models self-manage context via read-note-delete cycles, achieving 52% on deep research tasks where standard LLMs score only 5%.
  • Pichay (Pichay, 2026) applied OS demand-paging principles to LLM context, reducing context consumption by 93% in production with only 0.025% page fault rate.
  • MemoryArena (MemoryArena, 2026) revealed that agents with near-perfect static memory scores fail dramatically on interdependent multi-session tasks, fundamentally redefining how memory should be evaluated.
  • Context Channel Capacity (CCC, 2026) proved an Impossibility Triangle: zero forgetting, online learning, and finite parameters cannot coexist for sequential state-based learners.
The field converged on agents that actively curate their own context as a cognitive skill, with formal information-theoretic foundations explaining why some architectures fundamentally cannot avoid forgetting.
🔧

Memory Organization

What: Research on structuring, storing, and retrieving past interactions and knowledge for LLM-based agents, covering both inference-level memory management (KV cache) and cognitive-level memory design for recall QA and personalized conversation.

Why: As LLMs evolve from stateless single-turn tools to persistent agents with long-term interactions, effective memory organization becomes critical for maintaining coherence, personalization, and the ability to learn from experience without retraining.

Baseline: Conventional approaches either feed the entire conversation history into the prompt (computationally expensive and unscalable beyond context limits) or use flat vector similarity search over stored text chunks (shallow retrieval that misses implicit preferences and dispersed context).

  • Deciding what to store, update, or delete without explicit supervision—most user preferences are expressed implicitly across many sessions
  • Scaling retrieval precision as memory grows, since larger memory banks introduce more noise and irrelevant matches
  • Balancing memory persistence with adaptability—agents must retain useful knowledge while overwriting outdated information when circumstances change
  • Bridging the gap between low-level inference memory (KV cache management) and high-level cognitive memory (experiences, preferences, procedures)

🧪 Running Example

❓ A user asks their AI assistant: 'What restaurant did I mention wanting to try for our anniversary?' The relevant preference was casually mentioned 3 weeks ago during a conversation about weekend plans.

Baseline: A standard RAG system performs vector similarity search for 'restaurant anniversary' and retrieves recent mentions of restaurants from unrelated conversations (e.g., a lunch recommendation), missing the actual preference buried in a weekend-planning session from weeks ago.

Challenge: The preference was stated implicitly ('that new Italian place looked amazing for a special occasion'), never labeled as a preference, and surrounded by unrelated discussion topics. Simple similarity search lacks the depth to reconstruct this episodic memory.

✅ Recollection–Familiarity Adaptive Retrieval (RF-Mem): First performs a fast familiarity check, detects ambiguity (multiple restaurant mentions), then triggers deeper recollection that clusters and re-queries to find the specific anniversary-related mention
✅ Segment-Level Memory with Compression (SeCom): Segments the conversation history by topic rather than by turn, so the weekend-planning segment containing the restaurant preference is retrieved as a coherent unit rather than individual scattered turns
✅ RL-Optimized Memory Management (Memory-R1): The RL-trained memory manager would have identified the restaurant preference as worth storing when it was first mentioned, creating an explicit memory entry that can be directly retrieved later
✅ Active Context Engineering (StateLM): StateLM would have distilled the key preference into a persistent note during the original conversation, then deleted the raw chat, leaving a clean retrievable fact about the user's restaurant wish

📈 Overall Progress

Memory has evolved from a passive storage problem (KV cache paging) to an active cognitive capability where agents learn to manage their own memory through reinforcement learning and self-context engineering.

📂 Sub-topics

KV Cache & Inference Memory Management

10 papers

Methods for efficiently allocating, compressing, and retrieving Key-Value cache data during LLM inference, including paged memory, hierarchical indexing, and learned eviction policies.

PagedAttention vAttention LycheeCluster DapQ

Agent Memory Architecture & Taxonomies

8 papers

Surveys, taxonomies, and systematic evaluations of memory structures for LLM-based agents, including classification by form, function, and dynamics.

Forms-Functions-Dynamics Taxonomy 3D-8Q Taxonomy Mixed Memory Architecture Memory-as-Ontology

Experience-Driven Memory & Self-Evolution

12 papers

Methods that enable agents to accumulate, refine, and reuse experience across tasks through external memory, including RL-optimized memory management, procedural memory learning, and active context engineering.

Dynamic Cheatsheet Memory-R1 Mem-α MACLA

Personalized & Conversational Memory

8 papers

Memory systems designed for long-term dialogue and user personalization, including dual-process retrieval, segment-level compression, persona management, and structured data selection.

RF-Mem SeCom LD-Agent PersonaMem-v2

Embodied & Domain-Specific Memory

8 papers

Memory architectures tailored for robotics, vision-language navigation, video generation, and clinical AI, where memory must encode spatial, temporal, or procedural information beyond text.

SAM2Act+ JanusVLN MEM VMem

Parametric & Scalable Memory

5 papers

Approaches that embed memory directly into model parameters or use learned memory layers, including product-key memory, memory distillation, and continuous abstraction mechanisms.

Scalable Product-Key Memory Layers Parametric Memory Distillation Semantic Level of Detail

💡 Key Insights

💡 Memory management is shifting from a passive storage problem to an active cognitive skill that agents can learn through reinforcement learning.

💡 Positional encoding matters more than semantic content for KV cache importance scoring—where a token appears outweighs what it contains.

💡 RL-trained memory policies generalize dramatically: models trained on 30k tokens perform well at 400k+ tokens without degradation.

💡 External procedural memory built in seconds can outperform models 10x larger by decoupling reasoning from adaptation.

💡 Segment-level memory granularity consistently outperforms both turn-level and session-level retrieval for long conversations.

💡 The field is converging on memory as the locus of agent identity—models are replaceable vessels, but memory persists and defines the self.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (2020-2023) focused on efficient memory allocation for inference. 2024 brought systematic taxonomies and learned compression methods. 2025 saw the rise of RL-optimized memory management and experience-based self-evolution. By early 2026, the field converges on agents that actively curate their own context, with deterministic retrieval replacing fuzzy search and identity-preserving architectures emerging.

2020-08 to 2023-11 Foundational memory architectures for inference and interaction
  • (PagedAttention, 2023) revolutionized KV cache management by applying OS virtual memory concepts to GPU memory, achieving 2-4x throughput improvement and near-zero waste
  • (UMGR, 2020) pioneered memory graph reasoning for conversational recommendation, unifying offline user history and online dialog state in a single heterogeneous graph
  • (Coop, 2023) co-optimized tensor allocation and rematerialization, reducing memory fragmentation to under 5% for large model training
  • (InRecAgent, 2023) introduced a shared Candidate Bus memory for recommendation agents, replacing expensive in-context item lists with external storage
2024-01 to 2024-12 Systematization of agent memory and advanced KV cache techniques
  • (Memory Survey, 2024) established a unified taxonomy deconstructing memory into Sources, Forms, and Operations
  • vAttention (vAttention, 2024) replaced PagedAttention's non-contiguous design with GPU virtual memory mapping, improving throughput by up to 1.99x over vLLM
  • (NAMMs, 2024) evolved neural memory models that outperformed full-context Llama-3-8B by +11% on LongBench while reducing cache size
  • (Memory Layers, 2024) proved parametric memory viable at 128 billion parameters, doubling factual accuracy over dense baselines
  • (MemoRAG, 2024) introduced dual-system global memory augmented retrieval, outperforming GPT-4-128k on summarization tasks by a large margin
2025-01 to 2025-12 RL-driven memory management and experience-based self-evolution
  • (MACLA, 2025) demonstrated that external procedural memory built in 56 seconds can outperform models 10x larger, achieving 78.1% average across four benchmarks
  • (DC, 2025) enabled test-time learning where GPT-4o improved from 10% to 99% on Game of 24 by curating a persistent memory buffer
  • Memory-R1 (Memory-R1, 2025) applied GRPO-based RL to memory operations, achieving +28.5% F1 improvement on LoCoMo with only 152 training examples
  • PersonaMem-v2 (PersonaMem-v2, 2025) showed that RL-trained agentic memory enables a 4B model to outperform GPT-5 on implicit personalization using 16x fewer tokens
  • (SeCom, 2025) established segment-level memory as superior to both turn-level and session-level retrieval for long conversations
  • SAM2(SAM2Act, 2025) achieved 94.3% on memory-dependent robotic tasks by integrating explicit memory banks into visual-motor policies
2026-01 to 2026-03 Active self-context engineering, deterministic adaptation, and memory as identity
  • (StateLM, 2026) introduced the Pensieve paradigm where models self-manage context via read-note-delete cycles, achieving 52% on deep research tasks versus 5% for standard LLMs
  • (PRECEPT, 2026) replaced fuzzy natural language retrieval with deterministic exact-match rule lookup and Bayesian conflict resolution, gaining +41pp over Reflexion on hard tasks
  • (Arbiter, 2026) revealed that agent system prompts contain critical memory-related bugs (including data loss in Gemini CLI) detectable via formal analysis for just $0.27
  • (DxEvolve, 2026) demonstrated self-evolving clinical diagnosis surpassing human experts (90.4% vs 88.8%) by accumulating diagnostic cognition primitives as memory
  • (LycheeCluster, 2026) achieved 3.6x inference speedup through structure-aware hierarchical KV indexing with mathematical safety guarantees

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Paged & Virtual Memory for KV Cache Treat GPU KV cache like OS virtual memory—allocate physical pages on demand and map them to contiguous virtual addresses, eliminating fragmentation without modifying attention kernels. Static contiguous allocation that wastes 60-80% of KV cache memory due to over-provisioning for unknown sequence lengths Efficient Memory Management for Large... (2023), vAttention: Dynamic Memory Management for... (2024), MemServe (2024)
Learned & Hierarchical KV Cache Compression Replace hand-designed cache eviction heuristics with learned models or hierarchical indices that mathematically bound attention scores to safely prune irrelevant cache entries. Fixed-window or attention-score-based heuristics (like H2O or SnapKV) that miss tokens critical for future decoding steps LycheeCluster (2026), Where Matters More Than What:... (2026), An Evolved Universal Transformer Memory (2024)
RL-Optimized Memory Management Optimize memory management decisions directly against final answer correctness using RL, letting the agent learn what to store and when to update without explicit supervision. Static heuristic-based or prompt-instructed memory management that fails to adapt to diverse interaction patterns Memory-R1 (2025), Mem-α: Training LLMs to Manage... (2025), PersonaMem-v2 (2025)
Active Context Engineering Give the model a deleteContext tool so it can actively curate its working memory, distilling raw input into notes and freeing space for new information. Standard LLMs that monotonically accumulate context until hitting length limits, then either truncate or fail StateLM (2026), Distilling Feedback into Memory-as-a-Tool (2026)
External Procedural Memory with Contrastive Refinement Decouple reasoning from learning by storing reusable procedures externally and refining them via contrastive analysis of success/failure pairs. Parameter fine-tuning approaches that are expensive, entangle reasoning with adaptation, and risk catastrophic forgetting MACLA (2025), Dynamic Cheatsheet (2025), RetroAgent (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
LoCoMoF1 Score+28.5% F1 over MemoryOS baselineMemory-R1 (2025)
ALFWorldSuccess Rate90.3%MACLA (2025)
Needle-in-a-Haystack (NIAH)Accuracy99.5% with 3% KV budget (256 tokens)Where Matters More Than What:... (2026)

⚠️ Known Limitations (5)

  • Evaluation fragmentation: benchmarks are scattered across conversational QA, agentic tasks, and inference efficiency with no unified evaluation framework, making cross-method comparison difficult and rewarding methods optimized for narrow metrics. (affects: RL-Optimized Memory Management, External Procedural Memory with Contrastive Refinement, Segment-Level Memory with Compression)
    Potential fix: Unified evaluation suites like Evo-Memory and ATOD that test memory across multiple dimensions (retention, rewriting, generalization) in a single framework.
  • Memory rewriting remains underexplored: most systems excel at retaining information but struggle to selectively overwrite outdated content when circumstances change, leading to stale or contradictory memory states. (affects: RL-Optimized Memory Management, Dual-Process Adaptive Retrieval, Segment-Level Memory with Compression)
    Potential fix: Diagnostic benchmarks like Endless T-Maze that explicitly test overwrite capabilities, and Bayesian conflict detection mechanisms (as in PRECEPT) to identify and resolve stale knowledge.
  • Scalability of learned memory policies: RL-based memory management requires significant training infrastructure and may not transfer across different LLM architectures or domains without retraining. (affects: RL-Optimized Memory Management, Learned & Hierarchical KV Cache Compression)
    Potential fix: Universal memory models like NAMMs that transfer zero-shot across architectures by operating on attention patterns rather than token embeddings.
  • Privacy and governance gaps: as memory systems become more persistent and personal, there are no established mechanisms for memory governance, user consent, data minimization, or secure inheritance when models are upgraded. (affects: Dual-Process Adaptive Retrieval, RL-Optimized Memory Management, Segment-Level Memory with Compression)
    Potential fix: Constitutional memory architectures with immutable identity layers and formal inheritance protocols, combined with matroid-based data minimization for provable privacy guarantees.
  • Implicit preference capture: most memory systems are designed for explicit fact storage but struggle to identify and extract user preferences expressed indirectly through behavior patterns rather than explicit statements. (affects: Segment-Level Memory with Compression, Active Context Engineering (Pensieve Paradigm))
    Potential fix: RL-trained memory creation (PersonaMem-v2) that learns to detect and store implicit preferences, and adaptive retrieval (RF-Mem) that uses iterative reconstruction for ambiguous queries.
📚 View major papers in this topic (10)

💡 Having established the broad landscape of memory organization challenges, we begin with the most fundamental approach: Linear Memory, which maintains a sequential, bounded buffer of experiences and learns when to discard or update entries to keep the agent's working memory compact and relevant.

🎯

Linear Memory

What: Linear memory organizes an agent's accumulated experience as a sequential, bounded buffer where entries are created, updated, or discarded over time, rather than stored in graphs or unbounded logs.

Why: As LLM agents tackle long-horizon, multi-turn tasks, unbounded context growth causes quadratic compute costs, performance degradation beyond training lengths, and hallucination from irrelevant history. Linear memory provides a principled way to maintain a compact, relevant working memory.

Baseline: The standard approach is full-context prompting, which appends all prior turns and observations to the prompt. This works for short interactions but fails as history grows, causing context overflow, increased latency, and degraded reasoning quality.

  • Deciding what to discard: identifying which memories are no longer relevant without losing information needed for future reasoning
  • Balancing compression and fidelity: summarizing or overwriting memory inevitably loses detail, risking the loss of critical facts
  • Training memory policies: supervising memory management decisions is difficult because ground-truth labels for 'what to remember' rarely exist
  • Generalization across tasks: memory strategies learned on one task distribution often fail to transfer to new domains or longer horizons

🧪 Running Example

❓ An LLM agent is helping a user with a week-long research project involving 200+ turns of conversation. The user asks: 'What was the key limitation of the protein-folding paper we discussed on Monday, and how does it relate to the dataset issue from Wednesday?'

Baseline: A full-context system attempts to include all 200+ turns in the prompt. The context exceeds the model's window, so early turns (including Monday's discussion) are truncated. The agent either halluccinates an answer or fails to connect the two discussions.

Challenge: The relevant information spans two specific turns separated by hundreds of irrelevant exchanges. The agent must have retained both pieces of information despite processing many unrelated turns in between, and must be able to retrieve and synthesize them on demand.

✅ RL-Optimized Memory Overwrite (MemAgent): MemAgent reads each turn and uses an RL-trained policy to decide whether to overwrite existing memory slots or keep them. It retains the Monday and Wednesday insights in its fixed-size buffer because they were identified as answer-critical, while discarding small-talk turns.
✅ One-Step Memory Consolidation (MEM1): MEM1 maintains a single evolving Internal State that is updated at every turn. By the time the user asks the question, the key findings from Monday and Wednesday have been consolidated into a compact state, and all intermediate context has been pruned.
✅ Summarization-Based Context Management (SUPO): SUPO periodically compresses the interaction history into summaries. Monday's and Wednesday's discussions are preserved in their respective summaries, keeping the working context within the model's window while retaining critical information.
✅ Post-Thinking Memory (TiM): Think-in-Memory stores pre-computed 'thoughts' (conclusions like 'Paper X has limitation Y') rather than raw conversation text. When the query arrives, it retrieves the relevant thoughts directly without needing to re-reason over the original dialogue.

📈 Overall Progress

The field shifted from static heuristic memory (append or FIFO eviction) to RL-trained policies that autonomously learn what to remember and forget, achieving 400x context extrapolation.

📂 Sub-topics

RL-Driven Memory Policies

7 papers

Using reinforcement learning to train agents to autonomously decide when to create, update, or discard memory entries, treating memory management as a sequential decision-making problem optimized through outcome rewards.

RL-Optimized Memory Overwrite Atomic Memory Operations via GRPO Self-Memory Policy Optimization

Memory Compression and Consolidation

5 papers

Techniques that compress accumulated interaction history into bounded, information-dense representations through summarization, thought extraction, or learned consolidation, maintaining constant memory usage regardless of input length.

Summarization-Based Context Management Post-Thinking Memory Consolidation Generative Semantic Workspace

Experience-Based Self-Improvement

5 papers

Systems that mine past agent execution trajectories to extract reusable procedural knowledge (strategies, recovery tips, rules), building a growing memory bank of actionable lessons from experience.

Trajectory Mining for Procedural Memory Progressive Retrieval Augmented Generation

Neural Memory Architectures

6 papers

Architecture-level designs that augment Transformer models with explicit, differentiable memory modules using gated read/write mechanisms inspired by LSTMs or Hadamard operations.

Gated Neural Memory Banks Hadamard Memory Framework Language-Controlled Neural Memory

💡 Key Insights

💡 RL-trained memory policies consistently outperform static heuristics, enabling models to learn what to forget purely from task outcomes.

💡 Fixed-size memory buffers with learned overwrite policies can extrapolate from 8K training contexts to millions of tokens.

💡 Jointly optimizing memory extraction and management prevents noise accumulation that degrades performance over time.

💡 Storing pre-computed conclusions rather than raw text eliminates redundant re-reasoning and reduces retrieval costs.

💡 Dual-memory architectures (fast episodic + slow parametric) combine rapid adaptation with long-term generalization.

💡 Decomposing memory into atomic CRUD operations provides a flexible, learnable framework that scales to unseen context lengths.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (2023) established dual-memory architectures and thought-based storage as alternatives to raw-text memory. By 2025, reinforcement learning emerged as the dominant training paradigm for memory management, enabling models to learn discard/update policies purely from task-outcome rewards. The latest work (2026) focuses on unifying memory extraction and management into jointly optimized frameworks with atomic operations.

2023-04 to 2023-12 Early foundations in episodic memory, dual-memory architectures, and personalization through interaction history
  • (Two-Memory, 2023) introduced a dual-system architecture combining fast episodic control with slow parametric RL, demonstrating that data sharing between the two systems accelerates learning
  • (TAI, 2023) developed a teachable AI framework that learns user preferences from cold start through multi-turn seeker-provider interactions, achieving 97.4% turn-level accuracy
  • (TiM, 2023) pioneered storing pre-computed thoughts instead of raw text, with insert/forget/merge operations for memory maintenance
  • Talk2(Talk2Drive, 2023) demonstrated the first LLM-based personalization memory in a real-world autonomous vehicle, reducing driver takeover by 75.9%
2024-01 to 2024-12 Architectural innovations in stable memory modules and experience-driven learning for embodied agents
  • (HMF, 2024) introduced element-wise Hadamard products for numerically stable memory updates, achieving O(log t) processing via parallel prefix scan
  • (P-RAG, 2024) demonstrated progressive self-improvement in embodied tasks by building a dynamic experience database from the agent's own interaction history
2025-01 to 2025-12 RL-driven memory management emerges as the dominant paradigm, with breakthroughs in memory overwrite, consolidation, and summarization
  • LM2 (LM2, 2025) introduced a dual-stream memory Transformer with gated updates, outperforming RMT by 37.1% on BABILong while improving general reasoning on MMLU by 5.0%
  • MEM1 (MEM1, 2025) unified reasoning and memory consolidation into a single RL-trained step, achieving 3.5x performance gain with 3.7x memory reduction
  • (MemAgent, 2025) achieved the highest breakthrough score in this topic by extrapolating from 8K training context to 3.5M-token tasks with <5% loss using RL-trained memory overwrite
  • (SUPO, 2025) made summarization a learnable action within the RL training loop, achieving +14% success rate on BrowseComp-Plus with test-time scaling to 23 summaries
  • (GSW, 2025) introduced a neuro-inspired generative semantic workspace that models memory as evolving probabilistic state spaces, outperforming HippoRAG2 by up to 20% in recall
2026-01 to 2026-03 Unified frameworks that jointly optimize memory extraction, management, and atomic operations, with emphasis on generalizability
  • (AtomMem, 2026) decomposed memory management into atomic CRUD operations optimized via GRPO, scaling robustly to 800 documents (4x training size)
  • (UMEM, 2026) jointly optimized memory extraction and management using Semantic Neighborhood Modeling, achieving 82.84% Success Rate on ALFWorld with monotonic performance growth
  • (MemPO, 2026) introduced dual-reward RL for self-memory policy optimization, gaining +25.98% F1 while cutting token usage by 67.58%
  • (Trajectory-Informed, 2026) achieved 149% relative improvement on AppWorld by extracting typed procedural knowledge from execution logs

🔬 Key Methods

MethodKey InnovationImproves OnPapers
RL-Optimized Memory Overwrite Treat memory management as a sequential decision-making problem where an RL-trained policy learns to overwrite a fixed-size buffer, retaining only task-critical information. Full-context prompting and static memory heuristics (e.g., FIFO eviction or fixed summarization intervals) MemAgent (2025), AtomMem (2026), MemPO (2026), MEM1 (2025)
Summarization-Based Context Management Periodically compress interaction history into learnable summaries that preserve critical state information while keeping working context within model limits. Naive context truncation (which loses early information) and full-context prompting (which exceeds window limits) Scaling LLM Multi-turn RL with... (2025), Think-in-Memory (2023)
Trajectory Mining for Procedural Memory Parse agent execution trajectories to extract typed procedural knowledge (strategies, error recoveries, optimizations) that is reused in similar future situations. Stateless LLM agents that repeat the same errors and cannot reuse successful strategies across sessions Trajectory-Informed (2026), Progressive Retrieval Augmented Generation for... (2024), CMMR-VLN (2026)
Gated Neural Memory Modules Add an explicit memory matrix to the Transformer with learnable input/forget/output gates that control what information persists across long sequences. Standard Transformer attention (which degrades over long contexts) and prior memory-augmented models like Recurrent Memory Transformer (RMT) LM2 (2025), Stable Hadamard Memory (2024), Tell Me What To Learn:... (2026)
Unified Memory Extraction and Management Jointly train memory extraction and management using semantic neighborhood modeling, ensuring each memory generalizes across similar future queries rather than overfitting to one instance. Static memory extraction pipelines that treat summarization as a fixed preprocessing step, leading to noise accumulation UMEM (2026), Enabling On-Device Large Language Model... (2023)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
RULER (512K tokens)Accuracy>95%MemAgent (2025)
BABILongAverage Accuracy+37.1% over RMT baselineLM2 (2025)
ALFWorldSuccess Rate82.84%UMEM (2026)

⚠️ Known Limitations (5)

  • Information loss from compression is irreversible — once a memory entry is overwritten or discarded, the original detail cannot be recovered, which can be catastrophic when the discarded information turns out to be relevant later. (affects: RL-Optimized Memory Overwrite, Summarization-Based Context Management, One-Step Memory Consolidation)
    Potential fix: Hierarchical memory with multiple compression levels (recent detail + older summaries) or reversible memory operations like SoLA's key-deletion approach
  • RL training for memory policies requires expensive rollouts over long sequences. The reward signal is typically sparse (only final task outcome), making credit assignment to individual memory decisions difficult. (affects: RL-Optimized Memory Overwrite, Atomic Memory Operations via GRPO, Self-Memory Policy Optimization)
    Potential fix: MemPO's dense step-level reward (measuring how much memory increases probability of correct answer) and SUPO's sub-trajectory gradient splitting both address sparse reward issues
  • Memory strategies learned on one task distribution often fail to generalize to new domains or significantly different context lengths, requiring retraining or domain-specific tuning. (affects: RL-Optimized Memory Overwrite, Trajectory Mining for Procedural Memory)
    Potential fix: UMEM's Semantic Neighborhood Modeling enforces generalization by evaluating memory quality across clusters of similar queries; AtomMem demonstrates robust scaling to 4x training context lengths
  • Most approaches are evaluated in single-agent, single-task settings. Scaling linear memory to multi-agent systems or concurrent tasks where memory must be shared or partitioned remains unexplored. (affects: RL-Optimized Memory Overwrite, Gated Neural Memory Modules, Feedback-Driven Personalization Memory)
    Potential fix: Chow-Liu ordering optimizes shared memory access order in multi-agent chains; future work could extend atomic memory operations to support concurrent read/write from multiple agents
  • Evaluation benchmarks for memory quality are limited — most papers evaluate end-task performance rather than directly measuring whether the memory content is optimal, making it hard to diagnose memory failures. (affects: RL-Optimized Memory Overwrite, Unified Memory Extraction and Management, Summarization-Based Context Management)
    Potential fix: MemPO's step-level memory quality reward provides a proxy for memory evaluation; future work could develop dedicated memory quality benchmarks
📚 View major papers in this topic (10)

💡 While linear buffers provide a clean starting point, their single-tier design forces all memories to compete for the same limited capacity, motivating Layered Memory architectures that separate information into distinct tiers—such as core, semantic, and episodic—each with its own retention and retrieval granularity.

🔄

Layered Memory

What: Layered Memory research explores multi-tiered memory architectures for LLM agents that separate information into distinct layers—such as core/profile, semantic/factual, and episodic/temporal—with retrieval operating at session, turn, or topic granularity.

Why: Without structured memory layers, LLM agents lose coherence across extended interactions, cannot personalize responses to individual users, and fail to distinguish between recent context and long-term knowledge.

Baseline: The conventional approach uses a flat retrieval-augmented generation (RAG) pipeline that stores all past interactions as undifferentiated text chunks and retrieves them via semantic similarity search, regardless of memory type or temporal context.

  • Balancing memory granularity: too fine-grained (turn-level) fragments semantic topics, while too coarse (session-level) loses important details
  • Preventing memory degradation over time: older but important memories get buried by newer, less relevant ones without active consolidation and forgetting mechanisms
  • Integrating episodic (event-specific, temporal) and semantic (factual, stable) memories in a unified system that supports cross-type reasoning
  • Scaling memory systems efficiently: maintaining low latency and token costs as conversation history grows to hundreds of sessions

🧪 Running Example

❓ What did Alex say about wanting to move to Seattle last month? And didn't he mention something about his wife starting a new job there?

Baseline: A flat RAG system retrieves the top-k chunks most similar to 'Alex move Seattle,' but misses the wife's job mention because it appeared in a different session with different keywords. It also cannot reason about 'last month' because all chunks lack temporal indexing.

Challenge: The answer spans two separate conversation sessions (Alex's relocation plans and his wife's career update), requires temporal filtering ('last month'), and demands connecting two semantically distinct but narratively linked memories through the shared entity 'Alex.'

✅ Virtual Context Management (MemGPT): Pages relevant conversation sessions from external 'disk' storage into the LLM's active 'RAM' context, allowing it to access both the relocation and job discussions without context window limits.
✅ Episodic-Semantic Graph with Spreading Activation (Synapse): Links the 'Alex + Seattle' episodic memory to the 'wife + new job' memory through shared entity nodes in its graph. Spreading activation from 'Alex' propagates to connected nodes, surfacing both memories even though they are semantically distinct.
✅ Reflective Memory Management (RMM): Decomposes sessions into atomic topics (e.g., 'Alex's relocation' and 'wife's career'), enabling topic-level retrieval that captures both relevant topics. Its trained reranker prioritizes temporally relevant memories from 'last month.'
✅ Three-Stage Cognitive Memory (LightMem): Filters noise through sensory memory, groups both conversations under a shared topic in short-term memory, and consolidates the connection between Alex's move and his wife's job into long-term memory for efficient future retrieval.

📈 Overall Progress

The field evolved from treating memory as flat text retrieval to structured, multi-layered cognitive architectures with active forgetting, RL-driven evolution, and graph-based associative reasoning.

📂 Sub-topics

Episodic-Semantic Memory Integration

12 papers

Systems that explicitly separate and coordinate episodic memory (event-specific, temporally grounded) and semantic memory (factual, stable knowledge), enabling cross-type reasoning.

Spreading Activation Generative Semantic Workspace Unified Semantic-Episodic Benchmark

OS & Hierarchy-Inspired Memory Management

14 papers

Memory architectures modeled after operating system concepts (RAM/disk paging, segmented memory) or explicit multi-tier hierarchies (sensory, short-term, long-term).

Virtual Context Management Segmented Paging Pyramidal Memory

Cognitively-Inspired Memory Architectures

12 papers

Memory systems drawing from cognitive science theories—hippocampal indexing, Ebbinghaus forgetting curves, constructivist learning, and event segmentation—to design biologically plausible agent memory.

Hippocampal Graph Retrieval Ebbinghaus Forgetting Constructivist Memory Gist Memory

Self-Evolving & Agentic Memory

10 papers

Memory systems where agents actively manage, curate, and improve their own memory through reinforcement learning, self-reflection, or experience-based optimization.

Memory-augmented MDP Unified Memory Extraction Hierarchical Graph Memory

Memory Retrieval Optimization

10 papers

Methods that improve how memories are accessed—through tool-augmented retrieval, associative graphs, adaptive reranking, or just-in-time synthesis—moving beyond static top-k similarity search.

Tool-Augmented Retrieval Associative Memory Graph JIT Memory Compilation Reflective Retrieval

💡 Key Insights

💡 Layered memory architectures consistently outperform flat retrieval by separating stable knowledge from temporal events.

💡 Active forgetting mechanisms (Ebbinghaus-inspired decay) are essential to prevent memory pollution from outdated information.

💡 Graph-based associative retrieval discovers connections that vector similarity misses, especially for multi-hop reasoning.

💡 RL-optimized memory curation outperforms static rules, enabling agents to learn what to remember without LLM fine-tuning.

💡 Efficiency gains are dramatic: sensory filtering and sleep-time updates reduce token usage by 38-100x with comparable accuracy.

💡 Comprehensive benchmarks reveal 30-60% accuracy gaps between current systems and oracle retrieval, indicating substantial room for improvement.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research shifted from simply extending context windows (2023) to designing biologically inspired memory hierarchies with built-in consolidation and forgetting (2024), and most recently toward efficient, self-evolving systems that learn to manage their own memory through reinforcement learning and agentic retrieval (2025-2026).

2023-05 to 2023-12 Foundation era: establishing core paradigms for LLM long-term memory
  • (MemGPT, 2023) introduced the OS-inspired virtual context management paradigm, achieving +60.4% accuracy on deep memory retrieval by treating context as RAM and databases as disk
  • (LongMem, 2023) proposed a decoupled frozen-backbone + SideNet architecture for stable long-term memory retrieval, achieving state-of-the-art on ChapterBreak
  • (MemoryBank, 2023) pioneered Ebbinghaus forgetting curve-based memory decay for natural memory attenuation in LLMs
  • (TiM, 2023) introduced storing pre-computed 'thoughts' instead of raw text, decoupling reasoning from retrieval
2024-01 to 2024-12 Neuroscience-inspired designs and systematic benchmarking
  • (ReadAgent, 2024) demonstrated human-inspired gist memory extending effective context by 3.5-20x while outperforming full-context baselines
  • (HippoRAG, 2024) mapped hippocampal pattern completion to knowledge graph retrieval, outperforming single-step methods by up to 20% on multi-hop QA while being 10-20x cheaper
  • xLSTM (xLSTM, 2024) revived LSTM architecture with matrix memory and exponential gating, outperforming Mamba and Llama at 400M parameters
  • (LongMemEval, 2024) established the first comprehensive benchmark for long-term chat memory, revealing 30-60% accuracy gaps in state-of-the-art commercial systems
  • Talk2(Talk2Drive, 2024) demonstrated layered memory for personalized autonomous driving, reducing driver takeover rates by 75.9% in real-world field experiments
2025-01 to 2025-12 Mature systems: RL-driven evolution, reflective management, and unified taxonomies
  • (Memento, 2025) formalized memory-augmented MDPs with neural case selection, achieving top-1 on the GAIA benchmark with 87.88% Pass@3
  • (RMM, 2025) introduced bidirectional reflection—prospective topic decomposition and retrospective citation-based reranker training—improving LongMemEval accuracy by 10%
  • Two major surveys (Memory in the Age of AI, 2025; Operational Taxonomy, 2025) unified fragmented terminology with Forms-Functions-Dynamics and six atomic operations frameworks
  • (G-Memory, 2025) introduced three-tier hierarchical graph memory for multi-agent systems, improving ALFWorld success rate by 20.89%
  • (PersonaAgent, 2025) enabled test-time persona optimization through textual gradient loops, improving personalization by 5.7% on LaMP benchmarks
2026-01 to 2026-03 Efficiency breakthroughs and specialized memory for multimodal and web agents
  • (Synapse, 2026) unified episodic-semantic memory with spreading activation and lateral inhibition, reducing token consumption by 95% while achieving 40.5 F1 on LoCoMo
  • (MM-Mem, 2026) applied Fuzzy-Trace Theory to create pyramidal multimodal memory, achieving state-of-the-art 63.8% on EgoSchema while outperforming Gemini 1.5 Pro
  • (UMEM, 2026) jointly optimized memory extraction and management using semantic neighborhood modeling, achieving 82.84% on ALFWorld
  • (TA-Mem, 2026) transformed retrieval into an agentic task with multi-indexed tool selection, improving temporal QA by +7.02 F1 over Mem0

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Virtual Context Management Apply operating system memory management principles (paging, segmentation, eviction) to manage LLM context as a virtual address space. Fixed context window approaches that truncate or summarize old conversations MemGPT (2023), MemoryOS (2025), LightMem (2025)
Neurocognitive Memory Architectures Map specific brain regions and cognitive theories onto computational memory components to achieve biologically plausible knowledge management. Flat vector stores with no forgetting or consolidation mechanisms HippoRAG (2024), Nemori (2025), MemoryBank (2023), A Miniature Brain Transformer (2026)
Graph-Based Associative Retrieval Replace static similarity search with dynamic graph traversal that discovers associative connections between memories. Standard top-k vector retrieval that misses structurally linked but semantically distant memories Synapse (2026), AssoMem (2025), The Generative Semantic Workspace: A... (2025)
Pyramidal Multi-Resolution Memory Store information at multiple abstraction levels and retrieve top-down, expanding details on demand rather than processing everything upfront. Single-resolution memory that either stores raw data (expensive) or summaries (lossy) From Verbatim to Gist: Distilling... (2026), A Human-Inspired Reading Agent with... (2024), Enhancing Web Agents with a... (2026)
Self-Evolving Memory with Reinforcement Learning Train a memory retrieval and curation policy via RL rewards, allowing the agent to learn from experience without fine-tuning the LLM. Static memory management rules and fixed retrieval heuristics that cannot adapt Memento (2025), UMEM (2026), G-Memory (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
LoCoMoF1 / BLEU-1+49.11% F1 improvement over baselinesMemoryOS (2025)
LongMemEvalAccuracy70.4% accuracyIn Prospect and Retrospect: Reflective... (2025)
GAIA (General AI Assistants)Pass@3 / Accuracy87.88% Pass@3 (validation), 79.40% (test)Memento (2025)

⚠️ Known Limitations (5)

  • Benchmark coverage gaps: Most benchmarks focus on factual recall and multi-hop QA, neglecting dynamic memory operations like updating, forgetting, and conflict resolution that are critical for real-world use. (affects: Virtual Context Management, Graph-Based Associative Retrieval, Neurocognitive Memory Architectures)
    Potential fix: Design benchmarks that explicitly test memory update, conflict resolution, and forgetting operations over extended timelines, as proposed by StructMemEval and LongMemEval.
  • Memory poisoning from bad experiences: Agents that naively store all past interactions accumulate incorrect examples that degrade future performance through 'experience-following' behavior where agents blindly copy retrieved outputs. (affects: Self-Evolving Memory with Reinforcement Learning, Virtual Context Management)
    Potential fix: Use strict trajectory evaluators or fine-tuned quality judges to gate memory additions, as demonstrated by regulated memory management achieving +32.4% improvement over add-all baselines.
  • Scalability vs. precision trade-off: Graph-based and hierarchical memory systems provide better retrieval quality but introduce higher construction and maintenance costs as memory grows to hundreds of thousands of interactions. (affects: Graph-Based Associative Retrieval, Neurocognitive Memory Architectures, Pyramidal Multi-Resolution Memory)
    Potential fix: Hybrid approaches combining lightweight online updates with offline consolidation (sleep-time processing), as implemented by LightMem and Memory Bear.
  • Inability to spontaneously recognize needed memory structures: LLMs struggle to identify when and how to organize memory hierarchically without explicit hints, even when provided with memory tools. (affects: Virtual Context Management, Self-Evolving Memory with Reinforcement Learning)
    Potential fix: Provide memory organization hints or use meta-cognitive prompting to guide agents in recognizing structural requirements before task execution.
  • Fragmented evaluation: No standardized comparison framework exists across different memory architectures, making it difficult to compare approaches that use different benchmarks, metrics, and LLM backends. (affects: Virtual Context Management, Graph-Based Associative Retrieval, Neurocognitive Memory Architectures, Self-Evolving Memory with Reinforcement Learning)
    Potential fix: Adopt unified evaluation protocols with standardized benchmarks (LoCoMo, LongMemEval) and controlled LLM backends for fair cross-method comparison.
📚 View major papers in this topic (10)

💡 Layered architectures organize memories by type, yet still retrieve within each layer via flat similarity search, which misses the associative links between related memories—a limitation addressed by Tree/Graph-based Memory, which connects knowledge through entity, temporal, and causal edges to enable multi-hop reasoning.

🔍

Tree/Graph-based Memory

What: Tree/graph-based memory systems organize an agent's long-term knowledge as interconnected nodes in graphs or hierarchical trees, enabling associative retrieval that mirrors how humans hop between related memories by topic, entity, time, or causality.

Why: Standard vector-store retrieval treats memories as isolated items ranked by semantic similarity, missing structural relationships such as causal chains, temporal sequences, and entity connections that are critical for complex reasoning tasks.

Baseline: The conventional approach is flat Retrieval-Augmented Generation (RAG), which encodes text passages as dense vectors and retrieves the top-k most semantically similar chunks to augment an LLM's context window.

  • Similarity saturation: as memory grows, many items become semantically close, making top-k retrieval unreliable for distinguishing truly relevant memories
  • Cross-document integration: relevant evidence is often scattered across multiple documents or sessions, requiring multi-hop reasoning that flat retrieval cannot perform in a single step
  • Memory evolution: real-world knowledge changes over time, so memory structures must support efficient insertion, update, and deletion without full reconstruction
  • Balancing structure and cost: building and maintaining rich graph structures (entity extraction, relation inference) is expensive and must justify its overhead over simpler approaches

🧪 Running Example

❓ A personal assistant is asked: 'Why did I cancel my trip to Tokyo last month?' The answer requires connecting three separate memories: (1) a conversation about booking the trip, (2) a later message about a family emergency, and (3) an email cancelling the flight.

Baseline: Flat RAG retrieves the trip-booking conversation (highest semantic similarity to 'trip to Tokyo') but misses the family emergency message because it uses different vocabulary. The assistant cannot explain the reason for cancellation.

Challenge: The three memories are semantically distinct (travel planning, personal crisis, administrative action) but causally linked. Connecting them requires traversing entity links (user → trip → cancellation) and temporal ordering, which pure embedding similarity cannot capture.

✅ Knowledge Graph Indexing with PageRank Retrieval: HippoRAG-style systems extract entities ('Tokyo trip', 'family emergency', 'flight cancellation') into a knowledge graph and run Personalized PageRank from the query, spreading activation to reach the causally connected emergency memory even though it is semantically distant.
✅ Spreading Activation on Memory Graphs: Synapse links episodic memories (the three conversations) to semantic concepts ('travel', 'emergency', 'cancellation') and propagates activation from the query through the graph, naturally surfacing the emergency memory through its structural connection to the cancellation.
✅ Multi-Graph Memory Architecture: MAGMA maintains parallel causal, temporal, and entity graphs. For a 'why' query, it prioritizes causal edges, directly traversing from the cancellation event backward to the emergency that caused it.
✅ Self-Evolving Structured Memory: A-Mem's Zettelkasten-style system would have already linked these three notes via shared tags ('Tokyo', 'travel-plans') and auto-generated contextual connections, making the causal chain immediately retrievable.

📈 Overall Progress

The field evolved from flat vector retrieval to richly structured, multi-layered memory graphs with biologically-inspired dynamics like spreading activation, energy minimization, and self-evolving organization.

📂 Sub-topics

Knowledge Graph Memory with Graph Traversal

8 papers

Systems that extract entities and relations from text into knowledge graphs and use graph algorithms (PageRank, beam search, RL traversal) to retrieve interconnected memories for multi-hop reasoning.

HippoRAG MAGMA Mem0-graph AssoMem

Hierarchical and Tree-Structured Memory

5 papers

Systems that organize memories into hierarchical trees via clustering or dependency analysis, enabling top-down retrieval from abstract summaries to specific details.

Embodied-RAG Semantic Forest CAM Chow-Liu Tree Ordering HyMEM

Associative Memory Theory and Models

7 papers

Theoretical work connecting transformers, diffusion models, and in-context learning to classical associative memory frameworks like Hopfield networks, along with novel architectures built on these principles.

Memory Mosaics Hopfield-ICL Equivalence Entropic Associative Memory Energy-Based Routing

Cognitive-Inspired Multi-Component Memory Architectures

6 papers

Systems inspired by cognitive science that decompose memory into multiple specialized stores (episodic, semantic, procedural) connected via graph structures, with biologically motivated mechanisms like sleep consolidation and spreading activation.

Synapse Memory Bear MIRIX A-Mem

💡 Key Insights

💡 Graph structure enables multi-hop retrieval in a single pass, replacing expensive iterative chain-of-thought retrieval pipelines.

💡 Disentangling memory into semantic, temporal, causal, and entity layers dramatically improves intent-aligned retrieval for different query types.

💡 Self-evolving memory that merges, prunes, and rewrites entries outperforms static append-only stores, especially for long-horizon agents.

💡 Spreading activation on memory graphs surfaces structurally relevant but semantically distant memories that vector similarity misses.

💡 Associative memory theory (Hopfield networks) provides principled foundations for understanding and improving in-context learning and continual adaptation.

💡 Cognitive science frameworks (constructivism, ACT-R, hippocampal indexing) consistently inspire the most effective memory architectures.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work established theoretical connections between attention and associative memory (2023-2024), which catalyzed practical graph-based retrieval systems like HippoRAG. By 2025, the field shifted toward cognitive architectures with multiple specialized memory components and self-evolution capabilities. The latest work (2026) focuses on disentangling relationship types into parallel graph layers and applying neuroscience-inspired dynamics for more principled retrieval.

2023-11 to 2024-05 Theoretical foundations and early graph-based memory systems
  • (ICL-Hopfield, 2023) established the theoretical equivalence between self-attention and Hopfield network associative memory, providing a principled framework for understanding in-context learning
  • (VideoAgent, 2024) demonstrated structured dual-memory (temporal events + object tracking) for video understanding, achieving 26% accuracy improvement on long-form reasoning
  • (HippoRAG, 2024) introduced neurobiologically-inspired graph retrieval using a knowledge graph index with Personalized PageRank, achieving up to 20% improvement on multi-hop QA while being 10-20x cheaper than iterative methods
  • (Memory Mosaics, 2024) proposed networks of associative memories as a transparent alternative to transformers, matching perplexity while offering interpretable compositional capabilities
2024-09 to 2025-04 Editable and dynamic graph memory for personalized agents
  • (Embodied-RAG, 2024) introduced the Semantic Forest memory structure for robots, building hierarchical trees 7x faster than GraphRAG and handling kilometer-level environments
  • (EMG, 2024) pioneered editable memory graphs with RL-driven traversal for personalized smartphones, supporting dynamic insertion, deletion, and replacement of memories
  • (EHAM, 2024) extended entropic associative memory to hetero-associative tasks, achieving perfect recall of 40,000 associations in a single memory instance
  • (A-Mem, 2025) introduced Zettelkasten-inspired self-evolving memory with LLM-generated inter-note links, improving multi-hop reasoning by 192% over MemGPT
  • Mem0 (Mem0, 2025) proposed dynamic memory management with graph enhancements for multi-session dialogue, reducing latency by 91% while improving personalization by 26%
2025-05 to 2025-12 Cognitive architectures, scaled associative memory, and multi-component systems
  • Memory Mosaics v2 (Memory Mosaics v2, 2025) scaled associative memory networks to 10B parameters, outperforming transformers by 12-15% on multi-document QA tasks while matching performance on standard benchmarks
  • (MIRIX, 2025) deployed a six-component multi-agent memory architecture achieving 35% higher accuracy than RAG baselines while reducing storage by 99.9%
  • (CAM, 2025) applied Piaget's constructivist theory to agent memory with assimilation/accommodation mechanisms, running 4x faster than offline clustering baselines
  • (AssoMem, 2025) fused graph-based importance, semantic relevance, and temporal alignment via adaptive mutual information weighting, outperforming baselines by 24.93%
  • (GSW, 2025) modeled neocortical-hippocampal memory loops for episodic reasoning, outperforming HippoRAG2 by 20% in recall while reducing context tokens by 51%
2026-01 to 2026-03 Multi-graph disentanglement, energy-based routing, and generative memory workspaces
  • (MAGMA, 2026) introduced four disentangled graph layers (semantic, temporal, causal, entity) with intent-aware traversal, outperforming MemoRAG and Hi-Mem on long-context benchmarks
  • (Synapse, 2026) unified episodic-semantic memory with spreading activation and lateral inhibition, reducing token consumption by 95% while achieving state-of-the-art on LoCoMo
  • (Panini, 2026) replaced chunk-based retrieval with Generative Semantic Workspaces of atomic QA pairs and beam-search reasoning chains, achieving 5-7% gains over GraphRAG and HippoRAG
  • (HyMEM, 2026) introduced self-evolving hybrid memory with a VLM Judge for GUI agents, enabling a 7B model to surpass proprietary systems like Gemini-2.5-Pro
  • (Routing without Forgetting, 2026) applied Hopfield Pooling for energy-based associative routing in online continual learning, achieving 74.09% accuracy on Split-ImageNet-R with only 2.1% additional parameters

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Knowledge Graph Indexing with PageRank Retrieval Build a knowledge graph as a hippocampal index and use activation-spreading algorithms to retrieve structurally connected memories that pure semantic similarity would miss. Standard single-step dense retrieval (RAG) and expensive iterative retrieval methods (IRCoT) HippoRAG (2024), AssoMem (2025), Panini (2026)
Multi-Graph Memory Architecture Disentangle memory relationships into specialized graph layers so that retrieval can be steered by query intent rather than relying on a single similarity metric. Monolithic knowledge graphs and single-vector-store memory systems MAGMA (2026), Mem0 (2025), SGMem (2025)
Hierarchical Tree Memory with Top-Down Retrieval Cluster memories into a navigable tree hierarchy so retrieval can start broad and zoom into relevant details, mimicking how humans organize knowledge from general to specific. Flat retrieval over large memory pools and brute-force similarity search Embodied-RAG (2024), CAM (2025), Chow–Liu Ordering for Long-Context Reasoning... (2026)
Self-Evolving Structured Memory Let the LLM actively curate and evolve the memory graph rather than passively appending new entries, so the structure improves as the agent gains experience. Append-only memory stores and static knowledge graphs that require manual curation Agentic Memory (2025), Hybrid Self-evolving Structured Memory for... (2026), Crafting Personalized Agents through Retrieval-Augmented... (2024)
Spreading Activation on Episodic-Semantic Graphs Replace static similarity ranking with dynamic energy propagation through a memory graph, so relevance is determined by structural connectivity rather than just vector distance. Pure embedding-based retrieval and static graph-based retrieval without activation dynamics Synapse (2026), The Generative Semantic Workspace: A... (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
LoCoMoWeighted F1 / Accuracy40.5 Weighted F1Synapse (2026)
Multi-hop QA (MuSiQue / 2WikiMultiHopQA)Recall@5 / F1Up to 20% improvement in R@5HippoRAG (2024)
MeetingQA / Similarity-Dense QAAccuracy57.3% AccuracyAssoMem (2025)

⚠️ Known Limitations (5)

  • Graph construction overhead: extracting entities and relations via LLM calls is expensive (both in latency and cost), making it impractical for real-time or resource-constrained applications. (affects: Knowledge Graph Indexing with PageRank Retrieval, Multi-Graph Memory Architecture, Self-Evolving Structured Memory)
    Potential fix: MAGMA's dual-stream approach (fast path for immediate ingestion, slow path for asynchronous LLM-based densification) partially addresses this by decoupling latency-critical operations from expensive graph enrichment.
  • Scalability of graph traversal: as memory graphs grow to hundreds of thousands of nodes, traversal algorithms like Personalized PageRank or beam search become computationally expensive and may return noisy results. (affects: Knowledge Graph Indexing with PageRank Retrieval, Spreading Activation on Episodic-Semantic Graphs)
    Potential fix: Synapse uses lateral inhibition to suppress hub nodes, and region-based pruning (as in MAGMA) can limit traversal scope. Hierarchical tree approaches naturally reduce search space via top-down navigation.
  • Evaluation fragmentation: there is no unified benchmark for tree/graph-based memory, making it difficult to compare methods fairly. Different papers report on different subsets of benchmarks with varying metrics. (affects: Knowledge Graph Indexing with PageRank Retrieval, Hierarchical Tree Memory with Top-Down Retrieval, Spreading Activation on Episodic-Semantic Graphs)
    Potential fix: AssoMem introduced MeetingQA for similarity-dense scenarios, and LoCoMo has emerged as a common benchmark. Standardization around multi-dimensional evaluation (single-hop, multi-hop, temporal reasoning) would help.
  • Error propagation in knowledge extraction: LLM-based entity and relation extraction is imperfect, and errors in the graph structure (missing edges, wrong relations) cascade into retrieval failures that are hard to diagnose. (affects: Knowledge Graph Indexing with PageRank Retrieval, Self-Evolving Structured Memory, Multi-Graph Memory Architecture)
    Potential fix: HyMEM's VLM Judge approach (deciding add/merge/replace based on information gain) and Memory Bear's sleep-based consolidation offer self-correction mechanisms, but robust error detection in memory graphs remains an open problem.
  • Domain specificity: most systems are validated on text-based QA or dialogue tasks. Transfer to multimodal domains (video, robotics, GUI interaction) requires substantial architectural adaptation. (affects: Hierarchical Tree Memory with Top-Down Retrieval, Associative Memory Networks (Hopfield-Inspired))
    Potential fix: Embodied-RAG and HyMEM demonstrate that hybrid spatial-semantic clustering and visual embedding integration can bridge this gap, but general-purpose multimodal memory architectures remain underexplored.
📚 View major papers in this topic (10)

💡 Where graph-based approaches store knowledge externally and retrieve it on demand, Memory Internalization takes a fundamentally different path by encoding personal memories and domain facts directly into model parameters through LoRA adapters or embedding injection, trading retrieval flexibility for faster inference and implicit pattern capture.

📋

Memory Internalization

What: Memory internalization refers to techniques that store knowledge—personal memories, domain facts, or experiential data—directly into a language model's parameters, rather than relying on external retrieval or extended context windows.

Why: Parametric memory enables faster inference without retrieval latency, preserves user privacy by keeping data within local model weights, and can capture nuanced behavioral patterns that retrieval-based methods miss when context is noisy or irrelevant.

Baseline: The conventional approach uses Retrieval-Augmented Generation (RAG), which stores information externally and injects relevant passages into the prompt at inference time, or relies on large context windows to process user history directly.

  • Catastrophic forgetting: injecting new memories into model parameters often overwrites previously stored knowledge
  • Scalability: maintaining separate adapters or memory modules for each user or memory unit becomes resource-intensive as the number of memories grows
  • Random access: language models can reproduce memorized information sequentially but struggle to access specific facts from arbitrary positions in stored memory
  • Evaluation: measuring whether memories are truly internalized versus superficially memorized, and whether general capabilities are preserved after internalization

🧪 Running Example

❓ A user asks their personal assistant: 'What was the name of that Italian restaurant I said I wanted to try after my trip to Rome last month?'

Baseline: A standard LLM has no record of past conversations and cannot answer. A basic RAG system must search through all stored conversation logs, potentially retrieving irrelevant restaurant mentions or failing when the user's phrasing differs from the stored text.

Challenge: The assistant must recall a specific personal detail mentioned once in a past conversation, distinguish it from other restaurant mentions, and associate it with the temporal context of 'last month' and 'Rome trip'—requiring both precise memory storage and flexible retrieval from parameters.

✅ Per-User LoRA Adapters (OPPU): A personal LoRA adapter trained on this user's conversation history encodes the restaurant preference directly into model weights, enabling the model to recall the detail without external search or context window limits.
✅ Latent-Space Memory Pool (MemoryLLM): The restaurant mention is compressed into memory tokens within the transformer's hidden states when the original conversation occurs, and retrieved via attention when the user asks about it later.
✅ Per-Memory LoRA with Gating (MEGa): The conversation containing the restaurant mention is stored in its own frozen LoRA module with a context key; the query activates this specific module through gated routing, avoiding interference from other stored memories.
✅ RAG-Tuned-LLM: The user's conversation logs are used to synthesize fine-tuning data with entity-relationship extraction, teaching the model to internalize the relationship 'user → wants to try → Trattoria da Enzo → after Rome trip' as parametric knowledge.

📈 Overall Progress

The field evolved from static retrieval interpolation (kNN-LM, 2020) to dynamic, architecture-integrated memory systems that can continuously internalize, route, and even reverse knowledge updates without forgetting.

📂 Sub-topics

LoRA-Based Personal Memory

8 papers

Uses Low-Rank Adaptation (LoRA) adapters to store user-specific or memory-specific knowledge directly in model parameters, enabling personalized responses without modifying the base model.

Per-User LoRA (OPPU) Per-Memory LoRA with Gating (MEGa) Semantic Routing LoRA (SoLA) Memory-Injected LoRA (MiLP)

Latent-Space Memory Architectures

6 papers

Embeds memory directly into the transformer's latent space as trainable vectors or generated tokens, allowing the model to read and write memories through its own attention mechanism.

Latent Memory Pool (MemoryLLM) Scalable Long-Term Memory (M+) Generative Latent Memory (MemGen) Autoencoder Memory

Retrieval-to-Parameter Knowledge Transfer

5 papers

Converts external retrieval-based knowledge (datastores, document collections) into model parameters through distillation or fine-tuning, combining the precision of retrieval with the speed of parametric inference.

kNN-LM Memory Decoder RAG-Tuned-LLM Memory3 Explicit Memory

Lifelong Model Editing and Continual Memory

5 papers

Focuses on sequentially updating model parameters with new knowledge while minimizing catastrophic forgetting, enabling models to accumulate memories over long lifetimes of operation.

Sparse Residual Memory (MEMOIR) Explicit Read-Write Memory (MemLLM) Local Classifier Alignment (LCA)

💡 Key Insights

💡 Per-memory isolation (one frozen LoRA per fact) dramatically reduces catastrophic forgetting compared to shared-parameter fine-tuning.

💡 Combining parametric memory (LoRA) with non-parametric retrieval (RAG) consistently outperforms either approach used alone.

💡 Latent-space memory pools can self-update through attention but hit capacity limits requiring hierarchical CPU offloading solutions.

💡 Language models access memorized information sequentially; random access to specific stored facts remains a fundamental bottleneck.

💡 Generative memory that reconstructs context on demand outperforms static retrieval by producing task-specific cognitive context.

💡 Distilling retrieval into small parametric decoders enables plug-and-play domain adaptation across model scales with minimal latency.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research shifted from treating memory as external retrieval (2020) to embedding it within model parameters via LoRA and latent pools (2024), and most recently toward scalable, forgetting-resistant architectures with gated routing, generative memory, and sparse residual approaches (2025–2026).

2020-11 to 2023-12 Foundational work on retrieval-augmented memorization and early LoRA-based personalization
  • kNN-LM (kNN-LM, 2020) established that interpolating nearest-neighbor retrieval with model outputs achieves state-of-the-art perplexity, demonstrating the power of explicit memory access without parameter updates.
  • (DPeM, 2023) introduced dual-process memory with LoRA for medical assistant personalization, combining biologically-inspired memory tiers (working, short-term, long-term) with parameter-efficient fine-tuning.
  • Talk2(Talk2Drive, 2023) demonstrated the first end-to-end LLM-based personalization system in real-world autonomous driving, using a memory module that reduced driver takeover rates by 75.9%.
2024-01 to 2024-12 Emergence of per-user LoRA, latent-space memory pools, and fundamental understanding of parametric memory limitations
  • (MemoryLLM, 2024) pioneered embedding a 1B-parameter memory pool directly within transformer layers, enabling self-updating knowledge injection with +13.6% accuracy on model editing benchmarks.
  • (OPPU, 2024) formalized the one-PEFT-per-user paradigm, achieving state-of-the-art across all 7 LaMP personalization tasks by combining parametric and non-parametric knowledge.
  • Memory3 (Memory3, 2024) introduced a three-tier memory hierarchy (text → sparse KV pairs → parameters) that enabled a 2.4B model to outperform Baichuan2-7B while being 1.66x faster than RAG.
  • (Random Access, 2024) revealed a fundamental limitation: LMs can reproduce memorized content sequentially but fail at random access, identifying a critical bottleneck for parametric memory.
  • (MemLLM, 2024) pioneered training LLMs to generate explicit read/write API calls to structured memory, making memory operations interpretable.
2025-01 to 2025-12 Scalable memory architectures, generative memory, and overcoming forgetting in lifelong editing
  • M+ (M+, 2025) extended MemoryLLM's retention from 20k to 160k+ tokens by offloading evicted memory to CPU with a co-trained retriever, solving the capacity bottleneck of latent-space memory.
  • (RAG-Tuned-LLM, 2025) demonstrated that GraphRAG-derived synthetic data can internalize document knowledge into a 7B model, achieving a 77.2% win rate over vanilla RAG on global queries.
  • (MEGa, 2025) introduced per-memory LoRA with gated activation, maintaining >90% recall after 50 sequential memory injections while standard baselines collapsed to <10%.
  • (MemGen, 2025) proposed generative latent memory with a metacognitive trigger, achieving +31.7% improvement on ALFWorld with strong cross-domain transfer from math to science and code.
  • (MEMOIR, 2025) introduced sparse residual memory with TopHash retrieval, sustaining reliable editing through 15,000 sequential updates where all prior methods degraded.
  • (Memory Decoder, 2025) distilled kNN retrieval into a plug-and-play 0.5B decoder that adapts LLMs up to 72B parameters with only 1.28x latency overhead.
2026-01 to 2026-03 Reversible editing, continual learning alignment, and architectural memory integration
  • (SoLA, 2026) introduced semantic routing over frozen LoRA modules, enabling fully reversible model edits by simply deleting routing keys without affecting other stored knowledge.
  • (LCA, 2026) solved the classifier-backbone mismatch problem in continual learning by aligning classifiers to merged PEFT modules using synthetic Gaussian samples, leading on 7 benchmarks.
  • (RfR, 2026) formalized reflective MDPs where agents internalize experience through self-generated linguistic feedback and preference-based fine-tuning, outperforming both RL and prompt-based memory agents.
  • (Autoencoder Memory, 2026) showed that autoencoder-trained embeddings achieve >99% memory reconstruction accuracy, vastly outperforming causal model embeddings (20-60%) for information retention.

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Per-User LoRA Adapters Give every user their own tiny set of trainable parameters that personalize a shared frozen model, making personalization modular and privacy-preserving. Retrieval-augmented personalization (RAG), which fails when retrieved history is noisy or irrelevant, and profile-based prompting, which is limited by context window size. Democratizing Large Language Models via... (2024), LLM-based (2023), On the Way to LLM... (2024), Parameterized Memory-injected LLM Personalization (2024)
Per-Memory LoRA with Gated Routing Isolate each memory in its own frozen LoRA module and use query-driven routing to selectively activate relevant memories, eliminating catastrophic forgetting entirely. Standard LoRA fine-tuning, which overwrites previous knowledge when trained on new data sequentially, and shared adapter methods that suffer from semantic drift. MEGa (2025), Reversible Lifelong Model Editing via... (2026)
Latent-Space Memory Pools Add a bank of trainable vectors inside the transformer that the model can read and update through attention, creating a self-contained memory system within the architecture. External retrieval systems (RAG) that require separate infrastructure, and context-window approaches that are limited by fixed input length. MemoryLLM (2024), M+: Extending MemoryLLM with Scalable... (2025), Adaptive Loops and Memory in... (2026)
Generative Latent Memory Generate memory tokens dynamically through a separate module that reconstructs relevant context only when the reasoning process needs it, mimicking human recall as active reconstruction. Static retrieval-based memory (which returns fixed passages) and direct parameter updates (which cause forgetting), by dynamically synthesizing task-relevant memories. MemGen (2025)
Retrieval-to-Parameter Distillation Compress the knowledge in a retrieval datastore into model weights so the model can produce retrieval-quality answers without actually performing retrieval at inference time. Standard RAG, which incurs latency from nearest-neighbor search and requires maintaining large external datastores at inference time. Generalization Through Memorization (2020), Memory Decoder (2025), Tuning LLMs by RAG Principles:... (2025), Memory3 (2024)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
LaMP Personalization BenchmarkVarious (MAE, Accuracy, ROUGE-1)+17.38% relative MAE improvement on LaMP-3Democratizing Large Language Models via... (2024)
Sequential Knowledge RetentionRecall Cosine Similarity / QA Accuracy>90% recall cosine similarity after 50 sequential tasksMEGa (2025)
Lifelong Model Editing (zsRE / SCOTUS)Edit Reliability Rate (ERR) / Edit SuccessHigh reliability maintained after 15,000 sequential editsMEMOIR (2025)

⚠️ Known Limitations (5)

  • Catastrophic forgetting persists in sequential updates: even with LoRA, injecting many memories sequentially causes older memories to degrade unless each memory is fully isolated, which increases storage costs linearly. (affects: Per-User LoRA Adapters, Latent-Space Memory Pools)
    Potential fix: Per-memory isolation (MEGa, SoLA) and sparse residual updates (MEMOIR) mitigate forgetting but at the cost of linear storage growth per memory unit.
  • Scalability of per-memory modules: approaches creating a separate LoRA or memory entry per fact become resource-intensive as memory counts grow to thousands or millions, limiting real-world deployment. (affects: Per-Memory LoRA with Gated Routing, Explicit Structured Memory with API Access)
    Potential fix: Memory compression, hierarchical routing, and shared adapter pools could reduce per-memory overhead while maintaining isolation benefits.
  • Random access bottleneck: models can reproduce memorized content from the beginning but struggle to access specific facts at arbitrary positions, limiting the utility of parametric memory for precise fact lookup. (affects: Per-User LoRA Adapters, Latent-Space Memory Pools, Retrieval-to-Parameter Distillation)
    Potential fix: Training with permuted sentence order and recitation-based inference partially address this, but a general solution for arbitrary random access remains open.
  • Evaluation gaps: most work evaluates on synthetic or narrow benchmarks (fictional characters, curated QA pairs), and it remains unclear how well methods generalize to realistic, open-ended personalization at scale. (affects: Per-User LoRA Adapters, Per-Memory LoRA with Gated Routing, Latent-Space Memory Pools)
    Potential fix: Development of comprehensive personalization benchmarks with realistic multi-turn conversations and long-term memory requirements over months of interaction.
  • General capability degradation: fine-tuning for memory internalization can reduce performance on general NLP tasks (MMLU, commonsense reasoning), creating a memorization-generalization trade-off. (affects: Per-User LoRA Adapters, Retrieval-to-Parameter Distillation, Self-Reflective Parameter Updates)
    Potential fix: Freezing base model weights and using modular adapters (OPPU, MEGa) or residual memory layers (MEMOIR) helps preserve general capabilities while adding new knowledge.
📚 View major papers in this topic (10)

💡 Regardless of whether memories are stored in external structures or internalized into parameters, unbounded accumulation eventually degrades performance, which is why Memory Consolidation and Compression develops methods for summarizing, deduplicating, and distilling memories into compact representations that preserve essential information.

✍️

Memory Consolidation and Compression

What: Memory consolidation and compression encompasses methods for summarizing, compressing, deduplicating, and distilling memories so that LLM-based agents can maintain compact, relevant memory stores across long-horizon interactions without exceeding context window limits.

Why: As LLM agents engage in extended conversations, multi-step reasoning, and document analysis, raw interaction history grows unboundedly, degrading performance and increasing cost. Effective memory consolidation enables agents to retain critical information while discarding redundancy, making long-term personalized AI assistants practical.

Baseline: The conventional approach simply appends all past interaction turns to the prompt (full-context prompting), which causes linear memory growth, increased latency, and performance degradation when context length exceeds training limits. Alternatively, naive retrieval-augmented generation (RAG) treats memory as a flat vector store with no lifecycle management.

  • Deciding what to keep and what to discard: compression inevitably loses some information, and the system must learn which details are critical for future tasks
  • Maintaining coherence across compression steps: repeated summarization can cause semantic drift, hallucination, or loss of causal and temporal relationships
  • Balancing latency and quality: online compression must be fast enough for real-time interaction, while thorough reorganization requires expensive offline processing
  • Scaling to diverse memory types: systems must handle episodic conversations, factual knowledge, procedural skills, and multimodal data under a unified compression framework

🧪 Running Example

❓ What did my friend Sarah say about her new job when we talked two weeks ago?

Baseline: A standard LLM with a fixed context window only retains the most recent few sessions. The conversation from two weeks ago has been truncated, so the model responds 'I don't have information about that conversation.' Full-context prompting with all 50 past sessions would require ~100K tokens, exceeding limits and slowing inference.

Challenge: The relevant detail ('Sarah mentioned her new marketing role at Acme Corp') is buried in one of dozens of past sessions. Simple keyword retrieval may miss it if the user never said 'job' explicitly, and storing all raw transcripts is infeasible. The system must have compressed past sessions intelligently enough to retain this personal detail while discarding small talk.

✅ Virtual Context Management (MemGPT): Treats past sessions as 'disk storage' and pages in the relevant session when the agent detects a memory-dependent query, retrieving Sarah's job details on demand without keeping all history in the active context.
✅ Gist-Based Summarization (ReadAgent): Each past session was compressed into a short gist during reading. The gist 'Sarah discussed her new marketing role at Acme Corp and relocation plans' enables quick lookup and can be expanded to raw text if more detail is needed.
✅ RL-Optimized Consolidation (MEM1): After each session, the model updates a single consolidated internal state via reinforcement learning, retaining key personal facts like Sarah's job change while pruning irrelevant chitchat, keeping memory usage constant.
✅ Lightweight Cognitive Memory (LightMem): Sensory filtering removes low-value tokens during ingestion, and offline sleep-time consolidation organizes personal facts by entity (Sarah → job, location), enabling fast retrieval with 100x fewer tokens than raw storage.

📈 Overall Progress

Memory consolidation has shifted from static heuristic-based compression to learned, RL-optimized policies that jointly train task performance and memory management as a unified objective.

📂 Sub-topics

Gist and Summarization-Based Compression

5 papers

Methods that compress verbose conversation history or document content into compact natural-language summaries (gists), preserving key semantic content while dramatically reducing token count.

Gist Memory Pyramidal Memory Distillation Sleep-Time Consolidation Summarization Policy Optimization

OS-Inspired Hierarchical Memory Management

4 papers

Architectures that borrow operating system concepts (virtual memory, paging, context switching, lifecycle management) to manage LLM memory across fast and slow storage tiers.

Virtual Context Management Agent OS Kernel Memory Operating System

RL-Trained Memory Consolidation

3 papers

Approaches that use reinforcement learning to teach models what information to retain, compress, or discard, optimizing memory management as a learned policy rather than a fixed heuristic.

Self-Memory Policy Optimization 1-Step Consolidation Summarization-Augmented Policy Optimization

Representation-Level Compression

3 papers

Methods that compress memory at the representation level, including dynamic KV-cache merging, matrix-form memory cells, and visual rendering of code to reduce token counts.

Dynamic Memory Compression Extended LSTM Visual Code Compression

Bio-Inspired and Cognitive Memory Architectures

3 papers

Systems that draw from neuroscience and cognitive science (hippocampal consolidation, episodic-semantic memory separation, Ebbinghaus forgetting curves) to design memory consolidation mechanisms.

Brain-Region Decomposition Episodic-Semantic Consolidation Elastic Memory Orchestration

💡 Key Insights

💡 RL-trained memory consolidation consistently outperforms fixed heuristics, with models learning task-specific compression strategies that balance recall and efficiency.

💡 Sleep-time offline consolidation (asynchronous reorganization between sessions) is a recurring pattern that dramatically reduces online latency.

💡 The OS memory hierarchy analogy (RAM/disk paging) has become a foundational paradigm, adopted by MemGPT, AIOS, MemOS, and IronEngine.

💡 Gist-based compression can paradoxically improve accuracy over full context by filtering distracting information that degrades attention.

💡 Multi-graph memory structures that disentangle temporal, causal, and semantic relationships enable intent-aware retrieval far superior to flat vector stores.

💡 Bio-inspired forgetting mechanisms (Ebbinghaus decay, active pruning) are essential for preventing unbounded memory growth in long-lived agents.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

The field evolved from MemGPT's foundational OS-inspired memory paging (2023) through production-grade cognitive architectures with sleep-time consolidation (2025), converging in 2026 toward RL-trained memory policies, structured multi-graph memory, and bio-inspired architectures that unify compression with reasoning.

2023-10 to 2024-05 Foundation of OS-inspired memory and early compression techniques
  • (MemGPT, 2023) pioneered virtual context management, treating the LLM context window as RAM and external storage as disk, achieving +60.4% accuracy on deep memory retrieval
  • (ReadAgent, 2024) introduced human-inspired gist memory that extends effective context by up to 20x while outperforming retrieval baselines by 31.98% ROUGE-L on NarrativeQA
  • (DMC, 2024) introduced dynamic KV-cache compression with per-head learned merge decisions, achieving 350-390% throughput gains on H100 GPUs
  • (AIOS, 2024) extended the OS paradigm to multi-agent systems with syscall-based memory access and 2.1x throughput improvement
  • xLSTM (xLSTM, 2024) revived LSTMs with exponential gating and matrix memory, outperforming Transformers on 99.5% of text domains in PALOMA
2025-05 to 2025-12 Emergence of RL-trained consolidation and production-grade memory systems
  • (MemOS, 2025) formalized memory as a first-class OS resource with MemCube containers and automatic format transitions
  • MEM1 (MEM1, 2025) demonstrated that RL can train models to maintain a single evolving internal state, improving performance 3.5x while reducing memory 3.7x
  • (SUPO, 2025) jointly optimized task-solving and summarization via RL, achieving +14.0% success rate on BrowseComp-Plus with test-time scaling to 23 summary steps
  • (LightMem, 2025) introduced sensory filtering and sleep-time consolidation, reducing token usage by 38x while improving accuracy by 29.3% on LoCoMo
  • (Memory Bear, 2025) implemented active forgetting via Ebbinghaus decay curves and offline sleep-based memory reorganization
2026-01 to 2026-03 Convergence of structured memory, bio-inspired architectures, and multimodal consolidation
  • (MAGMA, 2026) introduced multi-graph memory with four parallel relationship graphs and intent-aware retrieval, outperforming MemoRAG and Hi-Mem on LoCoMo
  • (LongCodeOCR, 2026) replaced textual code compression with visual rendering, improving CompScore by 36.85 points while reducing compression latency from hours to minutes
  • (MemPO, 2026) achieved +25.98% F1 gain using dual-reward RL that measures memory quality by its impact on answer correctness
  • (MM-Mem, 2026) achieved state-of-the-art 63.8% on EgoSchema with pyramidal multimodal memory using fuzzy-trace theory and entropy-driven retrieval
  • (AutoAgent, 2026) unified evolving cognition with elastic memory, compressing history into episodic abstractions and reusable skills

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Gist-Based Memory Summarization Replace raw text with compact, human-readable summaries at multiple abstraction levels, enabling 3-20x effective context expansion while preserving decision-critical information. Full-context prompting (appending all raw history) and naive truncation (dropping oldest turns) A Human-Inspired Reading Agent with... (2024), From Verbatim to Gist: Distilling... (2026), LightMem (2025), Memory Bear (2025)
Virtual Context Management Apply OS concepts like virtual memory, paging, and context switching to give LLMs the illusion of unlimited memory within a fixed context window. Fixed context windows with no memory management and stateless session-based interactions MemGPT (2023), AIOS (2024), MemOS (2025), IronEngine (2026)
RL-Optimized Memory Consolidation Train models via reinforcement learning to proactively manage their own memory, treating 'what to remember' as a learnable decision optimized for task success. External memory modules with fixed heuristics and prompt-based summarization without task-aligned optimization MEM1 (2025), Scaling LLM Multi-turn RL with... (2025), MemPO (2026)
Dynamic KV-Cache Compression Replace the fixed 'always append' KV-cache update with a learned decision to either append or merge new tokens into existing cache slots, achieving 4-8x compression with minimal quality loss. Standard KV-cache that grows linearly with sequence length and grouped query attention (GQA) Dynamic Memory Compression (2024), xLSTM: Extended Long Short-Term Memory (2024)
Multi-Graph Structured Memory Disentangle memory relationships into separate typed graphs so that retrieval can prioritize the right relationship type (temporal, causal, or semantic) based on query intent. Monolithic vector stores that rely solely on semantic similarity for retrieval MAGMA (2026), AutoAgent (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
LoCoMoAccuracy29.3% improvement over baselinesLightMem (2025)
NarrativeQA (Gutenberg)ROUGE-L31.98% ROUGE-L improvement over retrieval baselinesA Human-Inspired Reading Agent with... (2024)
EgoSchemaAccuracy63.8%From Verbatim to Gist: Distilling... (2026)

⚠️ Known Limitations (5)

  • Information loss during compression is difficult to predict: critical details may be discarded during summarization, and there is no reliable way to know what was lost until the information is needed later. (affects: Gist-Based Memory Summarization, RL-Optimized Memory Consolidation, Dynamic KV-Cache Compression)
    Potential fix: Hierarchical gist systems (like ReadAgent and MM-Mem) mitigate this by allowing drill-down from summaries to raw text, and RL-based approaches learn to retain task-relevant details.
  • Evaluation is fragmented across different benchmarks and domains, making it hard to compare methods fairly. Most papers evaluate on different tasks with different metrics, and no universal memory consolidation benchmark exists. (affects: Gist-Based Memory Summarization, Virtual Context Management, RL-Optimized Memory Consolidation)
    Potential fix: Standardized benchmarks like LoCoMo and LongMemEval are emerging but still limited in scope; the field needs unified evaluation protocols covering conversation, reasoning, and multimodal memory tasks.
  • Scalability to truly long-lived agents (months or years of interactions) remains unproven. Most evaluations cover hours to days of interaction, and it is unclear if compression strategies degrade gracefully over much longer horizons. (affects: Virtual Context Management, Gist-Based Memory Summarization, Multi-Graph Structured Memory)
    Potential fix: Active forgetting mechanisms (Memory Bear's Ebbinghaus decay) and automatic memory format transitions (MemOS) are early steps toward long-horizon memory lifecycle management.
  • Hallucination risk during consolidation: when models generate summaries or compress memories, they may introduce fabricated details or subtly alter facts, especially under aggressive compression ratios. (affects: Gist-Based Memory Summarization, RL-Optimized Memory Consolidation)
    Potential fix: MM-Mem's entropy-driven retrieval drills down to raw data when uncertainty is high; SUPO's joint optimization teaches the model to write faithful summaries by directly penalizing downstream task failures.
  • Computational overhead of memory management itself can be significant: maintaining multiple graphs, running offline consolidation, or performing RL training adds complexity that may negate throughput gains for smaller deployments. (affects: Multi-Graph Structured Memory, Bio-Inspired Memory Architecture, RL-Optimized Memory Consolidation)
    Potential fix: LightMem's sensory filtering reduces overhead by 100x at test time, and IronEngine's hash-based deduplication provides a lightweight alternative to full model-based consolidation.
📚 View major papers in this topic (10)

💡 While Memory Organization tackles the fundamental question of how to structure stored knowledge for efficient access, the real test of any memory architecture is whether it enables accurate recall when users need it—which is precisely what Memory Recall research evaluates through increasingly sophisticated benchmarks spanning conversational QA, multi-modal retrieval, and temporal reasoning.

🕸️

Memory Recall

What: This topic covers methods for retrieving, managing, and utilizing information from past interactions, stored contexts, and long-term memory in LLM-based systems to answer user recall questions and support personalized, context-aware generation.

Why: As LLM agents handle increasingly complex, multi-session tasks, effective memory recall is essential for maintaining coherence, personalizing responses, and enabling users to retrieve information from their interaction histories.

Baseline: Conventional approaches either stuff the entire conversation history into the context window (which is limited and expensive) or use simple vector-similarity retrieval over stored interactions, which often misses nuanced or structurally complex queries.

  • Scaling memory across long interaction histories without exceeding context window limits or losing critical details
  • Retrieving the right information when queries require reasoning over multiple past events rather than simple keyword matching
  • Balancing memory fidelity with privacy, as stored interactions may contain sensitive personal information
  • Evaluating memory capabilities reliably, since existing benchmarks often test only shallow retrieval rather than complex recall

🧪 Running Example

❓ A user asks their AI assistant: 'What was the name of that Thai restaurant I told you about after my trip to Portland last summer, and did I say I'd go back?'

Baseline: A standard LLM with vector-similarity retrieval searches for 'Thai restaurant Portland' and returns the most semantically similar stored interaction. It might retrieve a conversation about Thai food in general but miss the specific Portland trip context, or fail entirely if the conversation was months ago and the memory has been evicted.

Challenge: This query requires multi-hop recall (linking a trip, a restaurant recommendation, and a sentiment), temporal reasoning ('last summer'), and the ability to search across a potentially large history of interactions spanning months.

✅ Causality Graph Retrieval (AMA-Agent): Stores interactions as a causality graph preserving state transitions, so it can trace the chain from 'Portland trip' → 'restaurant mention' → 'sentiment expressed' to retrieve the complete answer.
✅ Mood-Congruent Memory Retrieval (Emotional RAG): Considers the emotional context of the original conversation (enthusiasm about the restaurant) alongside semantic similarity, making it more likely to surface the right memory when sentiment is a key distinguishing factor.
✅ Predictive Hierarchical Caching (PerCache): Proactively caches anticipated recall queries during idle time, so when the user asks about past restaurant recommendations, the answer is already precomputed and returned with minimal latency.

📈 Overall Progress

Memory recall has evolved from static test-time retrieval to structured, value-aware agent memory systems with dedicated evaluation frameworks.

📂 Sub-topics

Memory-Augmented Model Architectures

5 papers

Methods that integrate explicit memory mechanisms directly into transformer architectures to extend effective context and improve recall during generation.

Compress & Attend Transformer Slow-Fast Inference TRIME Geometry-Aware Memory Scaling

Agent Memory Systems

4 papers

Long-term memory architectures for autonomous LLM agents that store, organize, and retrieve information from past agent-environment interactions.

AMA-Agent Causality Graph Agentic Plan Caching Q-Memory

Memory-Driven Personalization

4 papers

Methods that leverage stored user interaction histories and profiles to deliver personalized outputs, recommendations, and emotionally consistent responses.

Retrieval-Augmented Personalization Multi-Agent Collaborative Memory Mood-Congruent Retrieval

Context Compression & Efficient Caching

4 papers

Techniques for reducing the computational and memory costs of processing long contexts by compressing, caching, or selectively attending to stored information.

Selective Block Compression Predictive Hierarchical Caching Differential Subspace Steering Entropy-Aware Parallel Encoding

Memory Evaluation & Benchmarks

4 papers

Frameworks, benchmarks, and simulators for systematically evaluating how well LLM systems can store, retrieve, and reason over information in memory.

Bayesian-Causal Data Synthesis Programmable Atomic Memory Tests AMA-Bench

Privacy & Safety in Memory

2 papers

Research on privacy risks arising from LLM memory systems and tools for auditing what personal information models can recall or infer.

Memory Extraction Attack (MEXTRA) LMP2 Privacy Probe

💡 Key Insights

💡 Flat vector-similarity retrieval fails for agent memory; structured representations like causality graphs are needed for complex recall.

💡 Even frontier models like GPT-4o struggle with composite memory tasks requiring state tracking and multi-hop reasoning.

💡 Attention patterns remain stable within semantic spans, enabling dramatic speedups through slow-fast decoding strategies.

💡 Emotional and contextual signals significantly improve memory retrieval beyond pure semantic similarity.

💡 Proactive cache population during idle time outperforms reactive caching for mobile and latency-sensitive applications.

💡 Joint training of memory representations with the language model substantially outperforms test-time-only memory injection.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work focused on integrating memory into model training (TRIME) and establishing personalization benchmarks (LaMP). The field then shifted toward context efficiency (compression, caching) and rigorous evaluation frameworks, before converging on structured agent memory systems that preserve causal relationships and adapt retrieval strategies based on task utility.

2022-11 to 2023-11 Foundations of memory-augmented training and personalization benchmarks
  • (TRIME, 2022) introduced joint training of language models with in-batch memory, reducing WikiText-103 perplexity from 18.70 to 15.37 and outperforming test-time-only approaches like kNN-LM
  • (LaMP, 2023) established the first comprehensive personalization benchmark for LLMs with 7 diverse tasks, showing retrieval augmentation improves output quality by +23.5% over non-personalized baselines
  • (ARM-RAG, 2023) proposed storing successful reasoning chains as retrievable memory with obfuscation-based retrieval, improving math problem solving by +4.2% on GSM8K
2024-01 to 2024-12 Context efficiency techniques and memory evaluation infrastructure
  • (Entropy-Aware, 2024) identified attention entropy as the root cause of parallel context encoding failures and proposed shared attention sinks to restore recall accuracy from ~0% to near 100%
  • (Selective Compression, 2024) preserved key information as raw tokens while compressing tool documentation at up to 16× ratio without performance loss
  • (MemSim, 2024) introduced Bayesian-causal data synthesis for reliable memory evaluation, achieving >99% ground truth correctness while revealing that GPT-4 still struggles with aggregative and multi-hop recall
  • (Emotional RAG, 2024) incorporated mood-congruent retrieval into role-playing agents, improving MBTI personality accuracy from 59.74% to 67.53%
2025-01 to 2025-12 Agent memory systems, predictive caching, and adaptive architectures
  • (PerCache, 2025) introduced predictive hierarchical caching for mobile RAG, reducing end-to-end latency by 34.4% through proactive query generation during idle time
  • (Memory Framework, 2025) decomposed memory into atomic capabilities, revealing that even GPT-4o drops to ~45% accuracy on composite recall tasks like Theory of Mind
  • (APC, 2025) shifted from query-level to task-level plan template caching, cutting agent costs by 50.31% and latency by 27.28% while preserving 96.6% performance
  • (CAT, 2025) matched dense transformer quality while being 1.4–3× faster and 2–9× more memory efficient via parallel chunk compression with test-time adaptivity
2026-01 to 2026-03 Structured agent memory, differential steering, and long-horizon benchmarks
  • (AMA-Bench, 2026) revealed that existing memory systems significantly underperform on agentic tasks, with its AMA-Agent outperforming the strongest baselines by 11.16% via causality graph retrieval
  • (EvoKernel, 2026) used Q-value-driven memory retrieval to boost NPU kernel correctness from 11% to 83%, demonstrating emergent cross-task memory transfer
  • Prism-Δ (Prism-Δ, 2026) introduced dual-channel differential subspace steering for prompt highlighting, achieving +10.6% relative gain over the best prior baseline (SEKA)
  • (SFI, 2026) achieved up to 14.4× throughput improvement via slow-fast decoding that refreshes sparse caches only at semantic boundaries

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Memory-Augmented Language Modeling Train language models to jointly optimize context representations with external memory lookups, rather than adding memory only at test time. Standard transformer language models limited to fixed context windows, and test-time-only memory methods like kNN-LM that suffer from representation misalignment. Training Language Models with Memory... (2022), Compress & Attend Transformer (2025), Addressing Hallucinations in LLMs with... (2024)
Sparse Attention & Slow-Fast Decoding Decouple generation into frequent low-cost steps using a fixed sparse cache and rare dense steps that refresh memory at natural semantic boundaries. Full-KV attention baselines that redundantly recompute attention over the entire growing history at every decoding step. Slow-Fast Inference (2026), Attention Entropy is a Key... (2024)
Structured Agent Memory Replace flat similarity-based memory retrieval in agents with structured representations (graphs, templates, value functions) that capture causal and logical dependencies. Standard vector-similarity retrieval (RAG) and semantic caching, which lose causal structure and fail on machine-generated, symbol-heavy agent logs. AMA-Bench (2026), Agentic Plan Caching (2025), Towards Cold-Start Drafting and Continual... (2026)
Retrieval-Augmented Personalization Personalize LLM outputs by retrieving and contextualizing relevant memories from a user's history, using richer signals than simple text similarity. One-size-fits-all LLM generation that ignores individual user histories, and basic RAG systems that use only semantic similarity for retrieval. LaMP (2023), Emotional RAG (2024), ARAG (2025)
Context Compression & Selective Caching Preserve key information (names, parameters, critical spans) in raw form while aggressively compressing descriptive or redundant content into compact representations. Full-context baselines that waste compute on redundant information, and naive compression methods that lose critical details like parameter names. Concise and Precise Context Compression... (2024), PerCache (2025), Prism-Δ: Differential Subspace Steering for... (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
WikiText-103 (Language Modeling)Perplexity (lower is better)15.37Training Language Models with Memory... (2022)
AMA-Bench (Agent Memory)Average Accuracy57.22%AMA-Bench (2026)
LaMP (Personalization)Relative Average Improvement over non-personalized baselines+23.5%LaMP (2023)

⚠️ Known Limitations (5)

  • Most memory evaluation benchmarks focus on text-only recall, leaving multi-modal memory (images, audio, video from past interactions) largely untested, which limits our understanding of how well systems handle visual or auditory recall. (affects: Memory Evaluation Frameworks, Retrieval-Augmented Personalization)
    Potential fix: Extending evaluation frameworks to incorporate multi-modal interaction logs and cross-modal retrieval tasks, as hinted by Context-as-Memory's video-based approach.
  • Privacy risks scale with memory capability: more effective memory systems store and can leak more sensitive personal information, creating a fundamental tension between utility and privacy. (affects: Structured Agent Memory, Retrieval-Augmented Personalization)
    Potential fix: Differential privacy mechanisms for memory storage, user-controlled memory deletion, and periodic privacy audits using tools like LMP2.
  • Structured memory approaches (causality graphs, plan templates) require domain-specific schema design, making them difficult to generalize across diverse agent applications without significant engineering effort. (affects: Structured Agent Memory, Context Compression & Selective Caching)
    Potential fix: Automated schema induction from interaction logs, or hybrid approaches that combine structured and unstructured memory as in AMA-Agent's tool-augmented retrieval.
  • Context compression methods trade off fidelity for efficiency, and the optimal compression ratio varies significantly across tasks, requiring careful tuning that may not transfer between applications. (affects: Context Compression & Selective Caching, Sparse Attention & Slow-Fast Decoding)
    Potential fix: Adaptive compression that dynamically adjusts ratios based on task requirements, as demonstrated by CAT's test-time chunk size adaptivity.
  • Over-reliance on AI memory systems may cause cognitive atrophy in users, reducing their own ability to recall and reason about information they have offloaded to the AI. (affects: Retrieval-Augmented Personalization, Memory-Augmented Language Modeling)
    Potential fix: Designing memory interfaces that encourage active user engagement rather than passive consumption, such as prompting users to recall before revealing stored information.
📚 View major papers in this topic (10)

💡 With the memory recall problem framed, we first examine Sparse Memory QA, where the core challenge is locating and aggregating relevant information that is thinly scattered across a large memory store, often requiring multi-hop reasoning over fragmentary evidence.

🔗

Sparse Memory QA

What: Sparse Memory QA addresses the challenge of answering questions when relevant information is distributed thinly across stored memories or knowledge representations, requiring selective retrieval and aggregation of scattered evidence.

Why: Real-world knowledge is inherently fragmented—personal memories, entity facts, and contextual details are spread across many sources, making it critical for systems to locate and combine sparse signals accurately.

Baseline: Standard language models encode knowledge in dense parameters and retrieve it implicitly during generation, often hallucinating when the needed fact is rare or absent; simple retrieval-augmented approaches concatenate retrieved passages but struggle when evidence must be assembled from multiple sparse sources.

  • Locating the few relevant memory entries among a large pool of stored information, especially when queries use vague temporal or spatial cues
  • Aggregating evidence across multiple sparse memory fragments to compose a coherent answer, rather than relying on a single retrieved passage
  • Scaling memory access efficiently so that adding more stored knowledge does not proportionally increase computation cost

🧪 Running Example

❓ What was the name of the Italian restaurant I visited near the conference venue last Tuesday?

Baseline: A standard retrieval-augmented LLM embeds the query and retrieves the top-k most semantically similar memory entries. It may return memories about Italian food or conferences in general but miss the specific visit because no single memory explicitly states all details together, and it cannot resolve 'last Tuesday' to a concrete date.

Challenge: The answer depends on combining at least two sparse memories—a photo taken at a restaurant (with an OCR-readable sign) and a calendar entry showing a conference on that date—while correctly interpreting the vague time reference 'last Tuesday.'

✅ End-to-End Memory Networks (MemN2N): Performs multiple 'hops' of soft attention over memory slots, progressively refining which memories are relevant. The first hop might locate conference-related memories; the second hop narrows to restaurant visits near that date.
✅ Entities as Experts (EAE): Maintains a dedicated learned embedding for each entity (e.g., 'Trattoria Roma,' 'NeurIPS 2024'), so when the query mentions a restaurant near a conference, the model can directly activate the relevant entity experts rather than scanning all parameters.
✅ Pensieve (Task-Oriented Memory Augmentation): Pre-augments each memory image with OCR text, captions, and timestamps offline, then uses a multi-signal retriever that explicitly scores date matching ('last Tuesday'), location proximity, and semantic similarity—directly addressing vague temporal cues.

📈 Overall Progress

Research evolved from supervised memory reading to fully differentiable sparse memory architectures that scale to millions of entities and personal multimodal memories.

💡 Key Insights

💡 Multi-hop attention over memory enables iterative reasoning that single-pass retrieval cannot achieve.

💡 Sparse entity-specific memory access can match or outperform models 10x larger in parameter count.

💡 Fixed vocabulary-based routing outperforms learned dynamic routing for knowledge-intensive tasks.

💡 Multimodal memory QA benefits greatly from offline metadata augmentation and multi-signal retrieval.

💡 Reducing supervision requirements (from labeled supporting facts to end-to-end training) dramatically broadens applicability.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work focused on making memory access differentiable (MemN2N); subsequent research embedded entity-specific knowledge into sparse model components (EAE, MoWE); the latest work extends sparse memory QA to multimodal personal settings with structured retrieval signals (Pensieve).

2015-03 to 2015-03 Foundational memory-augmented architectures with end-to-end training
  • MemN2N (MemN2N, 2015) introduced continuous multi-hop attention over external memory, eliminating the need for strong supervision and achieving 3.2% mean error on bAbI QA tasks
2020-11 to 2020-11 Sparse entity-level memory access within Transformers
  • (EAE, 2020) replaced dense parameter lookup with sparse entity-specific memory slots, achieving 43.2% EM on TriviaQA with 10x fewer parameters than T5-3B
2023-09 to 2024-07 Systematizing and scaling sparse memory approaches for LLMs
  • (Survey, 2023) provided a unified taxonomy of retrieval (sparse vs. dense) and generation (concatenation vs. fusion) strategies for memory-augmented models
  • (MoWE, 2024) introduced fixed vocabulary-based routing to turn MoE experts into semantic memory slots, outperforming T5-XL on TriviaQA at 8.6x fewer FLOPs
2025-11 to 2025-11 Multimodal personal memory QA with multi-signal retrieval
  • (Pensieve, 2025) combined offline memory augmentation with multi-signal retrieval (time, location, semantics), improving QA accuracy by up to 14% over standard multimodal RAG

🔬 Key Methods

MethodKey InnovationImproves OnPapers
End-to-End Memory Networks Multiple soft-attention hops over memory allow iterative refinement of which sparse facts are relevant, without requiring supervision on which memories to read. Original Memory Networks (Weston et al., 2015), which required explicit supervision labels for supporting facts at each layer End-To-End (2015)
Entities as Experts Replace dense parameter lookup with sparse, entity-specific memory slots that are activated only when the corresponding entity is mentioned. Dense Transformer models (e.g., T5) that store all knowledge in shared parameters, requiring massive parameter counts to recall rare entity facts Entities as Experts (2020)
Mixture-of-Word-Experts Assign each word or entity to a dedicated FFN expert using a fixed vocabulary-based routing, turning MoE experts into semantic memory slots. Standard MoE models with learned routing (e.g., GShard Top-2) that lack semantic specialization, and dense models (e.g., T5-XL/XXL) that require proportionally more FLOPs to scale Memory Augmented Language Models through... (2024)
Pensieve Pre-augment multimodal memories with structured metadata and retrieve using multiple explicit signals (time, location, semantics) rather than relying on a single embedding similarity. Standard multimodal RAG pipelines that rely solely on semantic embedding similarity and cannot handle vague temporal or spatial references Memory-QA (2025)
Taxonomy of Memory-Augmented LLMs Organize the landscape of memory-augmented language models by their retrieval strategy (sparse vs. dense) and generation strategy (concatenation vs. fusion). Ad-hoc descriptions of individual retrieval-augmented systems, which lacked a unifying categorization Memory-Augmented (2023)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
TriviaQA (Open-Domain)Exact Match (EM)44.8%Memory Augmented Language Models through... (2024)
bAbI QA Tasks (10k training)Mean Error Rate3.2%End-To-End (2015)
MemoryQAQA AccuracyUp to 14% improvement over SOTA MM-RAGMemory-QA (2025)

⚠️ Known Limitations (4)

  • Entity coverage dependency: Sparse entity memory methods (EAE, MoWE) require pre-defined entity vocabularies, so they cannot handle novel or unseen entities that emerge after training. (affects: Entities as Experts (EAE), Mixture-of-Word-Experts (MoWE))
    Potential fix: Dynamic entity discovery and embedding expansion during inference, or periodic vocabulary updates with incremental training.
  • Scalability of memory hops: Multi-hop memory reading (MemN2N) increases computational cost linearly with hop count, and determining the optimal number of hops for a given query remains an open problem. (affects: End-to-End Memory Networks (MemN2N))
    Potential fix: Adaptive hop termination mechanisms that dynamically decide when sufficient evidence has been gathered.
  • Offline augmentation bottleneck: Pensieve's approach requires pre-processing all memories with OCR, captioning, and metadata extraction, which may not scale to continuously growing personal memory stores in real time. (affects: Pensieve (Task-Oriented Memory Augmentation and Retrieval))
    Potential fix: Incremental and streaming augmentation pipelines that process new memories as they arrive rather than in batch.
  • Evaluation on narrow benchmarks: Most methods are evaluated on a small number of QA benchmarks (TriviaQA, bAbI, MemoryQA), making it unclear how well sparse memory approaches generalize to open-ended or conversational settings. (affects: Entities as Experts (EAE), End-to-End Memory Networks (MemN2N), Mixture-of-Word-Experts (MoWE), Pensieve (Task-Oriented Memory Augmentation and Retrieval))
    Potential fix: Development of diverse, multi-domain evaluation suites that test sparse memory QA across conversational, long-form, and multi-turn settings.
📚 View major papers in this topic (5)

💡 When memory stores grow from sparse collections to dense, continuously captured streams of daily photos, calendar entries, and activity logs, the retrieval challenge inverts—Dense Memory QA must distinguish the correct memory from a sea of near-duplicates rather than hunting for scattered fragments.

⚙️

Dense Memory QA

What: Dense Memory QA addresses question answering over large, highly similar personal memory stores—such as daily photos, calendar entries, and activity logs—where memories overlap significantly and contain many near-duplicates.

Why: As personal devices continuously capture vast streams of multimodal data, users need accurate answers to recall questions (e.g., 'What did I eat last Tuesday?'), but the sheer density and similarity of stored memories makes retrieval and reasoning extremely challenging.

Baseline: Standard multimodal RAG systems retrieve memories by semantic similarity alone and feed them to a language model, but they fail to exploit temporal/spatial signals and cannot handle noise from highly similar irrelevant memories or perform aggregation across heterogeneous data types.

  • Retrieving the right memories when many entries are near-duplicates with only subtle temporal or spatial differences
  • Leveraging vague temporal and location anchors (e.g., 'last week', 'at the mall') that require specialized parsing beyond semantic similarity
  • Aggregating information across multiple heterogeneous memory sources (images, tables, text logs) that may exceed context limits
  • Filtering out retrieval noise—irrelevant but semantically similar memories—without discarding genuinely useful context

🧪 Running Example

❓ How many times did I order coffee from Starbucks last month?

Baseline: A standard RAG system retrieves the top-k memories by semantic similarity to 'coffee Starbucks'. Because the user visits Starbucks daily, many near-identical receipts and photos are returned, often exceeding context limits or including irrelevant tea orders. The model cannot reliably count distinct events or resolve 'last month' to exact dates, producing an inaccurate answer.

Challenge: The user's memory store contains hundreds of highly similar Starbucks entries (receipts, photos, location check-ins). Many are near-duplicates from different days. The system must resolve 'last month' to a precise date range, deduplicate across modalities, and perform an aggregation (counting) that pure retrieval-then-generate pipelines struggle with.

✅ Pensieve (Multi-Signal Memory Retrieval): Pensieve augments each memory with rich text metadata (OCR, captions) offline, then scores candidates using time recency, date matching, and location matching alongside semantic similarity—filtering to only Starbucks visits within the correct month. Its noise-injected training teaches the answer generator to ignore irrelevant retrieved entries.
✅ ReQAP (Recursive Question Decomposition): ReQAP recursively decomposes the question into sub-operations: first a RETRIEVE step to find all Starbucks-related records, then an EXTRACT step to parse order details from unstructured text, and finally a COUNT aggregation—handling the full analytical pipeline that flat retrieval cannot.

📈 Overall Progress

Research has shifted from generic semantic retrieval to specialized pipelines that exploit temporal/spatial signals and recursive decomposition for dense personal memory QA.

💡 Key Insights

💡 Semantic similarity alone is insufficient for dense memory retrieval; temporal and spatial signals are essential.

💡 Offline metadata augmentation enables text-based reasoning that matches expensive vision-language model performance.

💡 Noise-injected training makes answer generators robust to irrelevant but semantically similar retrieved memories.

💡 Recursive question decomposition bridges the gap between structured SQL queries and unstructured text retrieval.

💡 Distillation to small on-device models enables private personal QA without sending data to cloud services.

💡 Dense personal data demands hybrid operators that combine retrieval, extraction, and aggregation in a unified pipeline.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

In 2025, two complementary directions emerged: multi-signal retrieval with noise-robust generation for multimodal recall questions (Pensieve), and recursive analytical decomposition for complex queries over massive heterogeneous personal data (ReQAP). Both approaches recognize that standard RAG is insufficient for dense personal memory stores.

2025-05 to 2025-11 Emergence of specialized pipelines for dense personal memory QA, moving beyond generic RAG toward multi-signal retrieval and recursive analytical decomposition
  • (ReQAP, 2025) introduced recursive question decomposition with hybrid RETRIEVE and EXTRACT operators, enabling complex aggregation queries over 100K+ token heterogeneous personal data archives while supporting distillation to small on-device models
  • (Pensieve, 2025) proposed task-oriented memory augmentation and multi-signal retrieval combining temporal, spatial, and semantic scoring, achieving up to 14% accuracy improvement over standard MM-RAG on the MemoryQA benchmark

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Task-Oriented Memory Augmentation and Multi-Signal Retrieval Enrich memories with structured text metadata offline and retrieve using multiple explicit signals (time, location, semantics) rather than semantic similarity alone. Standard multimodal RAG that relies solely on embedding-based semantic retrieval and expensive VLMs for visual reasoning Memory-QA (2025)
Recursive Question Decomposition over Heterogeneous Data Recursively break complex questions into a tree of retrieval, extraction, and aggregation operators that jointly handle structured and unstructured personal data. Standard Text-to-SQL (which cannot handle unstructured text) and standard RAG (which cannot perform aggregations or handle 100K+ token archives) Recursive Question Understanding for Complex... (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
MemoryQAQA Accuracy+14% over SOTA MM-RAGMemory-QA (2025)
PerQAAccuracy on complex aggregation tasksSignificant improvement over Text-to-SQL baselinesRecursive Question Understanding for Complex... (2025)

⚠️ Known Limitations (3)

  • Reliance on metadata quality: Pensieve's augmentation pipeline depends on OCR and LLM-generated captions being accurate; errors in metadata propagate to retrieval and answer generation, particularly for low-quality images or ambiguous visual content. (affects: Pensieve)
    Potential fix: Incorporating confidence scores for augmented metadata and falling back to visual reasoning for low-confidence entries.
  • Scalability to very large personal archives: While ReQAP uses cascade pruning, recursive decomposition over tens of thousands of events may still face latency and cost challenges on resource-constrained devices. (affects: ReQAP)
    Potential fix: Further pruning strategies, indexing, and caching intermediate decomposition results to reduce per-query computation.
  • Limited evaluation scope: Each method is tested on its own benchmark (MemoryQA, PerQA) with no cross-evaluation, making it hard to compare their relative strengths on the same tasks or data distributions. (affects: Pensieve, ReQAP)
    Potential fix: Developing unified benchmarks for dense personal memory QA that span multimodal recall and complex analytical queries.
📚 View major papers in this topic (2)

💡 Effective recall from stored memory is necessary but not sufficient for autonomous agents—agents must not only retrieve relevant past experiences but actively use them to plan, coordinate with other agents, and continuously improve their behavior, which motivates the specialized memory architectures explored in Memory for Agentic Systems.

🤖

Memory for Agentic Systems

What: This topic covers memory systems designed for LLM-based agents that enable persistent state, experience accumulation, and adaptive behavior across interactions, going beyond the single context window.

Why: Without memory, LLM agents are stateless—they repeat mistakes, forget user preferences, and cannot learn from experience, severely limiting their utility for long-running, real-world tasks.

Baseline: The baseline approach treats the LLM's context window as the sole memory, stuffing all prior interactions and instructions into the prompt, leading to context overflow, information loss, and quadratic cost scaling.

  • Context windows are finite and expensive, yet agents must reason over arbitrarily long interaction histories
  • Memory must evolve over time without accumulating errors from hallucinations, drift, or adversarial poisoning
  • Selecting what to remember, forget, or consolidate requires balancing relevance, recency, and cost
  • Memory systems introduce new attack surfaces where adversaries can inject or corrupt stored knowledge

🧪 Running Example

❓ A user asks their personal AI assistant to plan a birthday dinner for their partner, having discussed dietary restrictions (vegan), preferred cuisines (Japanese), and budget constraints across multiple past conversations over several months.

Baseline: A stateless LLM has no memory of prior conversations. It asks the user to re-specify all preferences from scratch, ignores that the partner recently became vegan (mentioned 3 months ago), and cannot recall the budget discussed last week. The context window approach might try to stuff all past conversations into the prompt, but this exceeds token limits and becomes prohibitively expensive.

Challenge: The assistant must retrieve specific facts (vegan diet, Japanese cuisine preference, budget) from different past sessions, resolve conflicts (the partner was vegetarian before but switched to vegan recently), and ignore irrelevant memories (past discussions about lunch spots), all while keeping context costs manageable.

✅ Memoria (Hybrid Memory with Knowledge Graph): Stores user preferences as weighted knowledge graph triplets with recency decay, so 'partner → diet → vegan' (recent) overrides 'partner → diet → vegetarian' (older), retrieving only relevant facts without loading entire conversation histories.
✅ Pichay (Demand Paging for Context): Evicts stale conversation content to a backing store, keeping only retrieval handles in context. When the agent needs the budget figure, it 'pages in' just that conversation segment, reducing context usage by over 90%.
✅ AI Persona (Dynamic Profile Dictionaries): Maintains a structured, continuously updated user profile dictionary with fields like 'partner_dietary_preferences: vegan' and 'dining_budget: $150', so the agent accesses a compact, current summary instead of raw conversation logs.

📈 Overall Progress

Agent memory has evolved from simple context stuffing to OS-inspired hierarchical systems with formal governance, language-level safety guarantees, and learned eviction policies.

📂 Sub-topics

Memory Architecture & Frameworks

7 papers

Core architectural patterns for agent memory, including memory hierarchies, hybrid storage systems, and language-level primitives for persistent state management.

Demand Paging (Pichay) Cognitive Type Safety (Turn) Memoria SSGM Framework

Context Engineering & Optimization

4 papers

Methods for efficiently managing, compressing, and adaptively curating the information environment in which agents operate, treating context as a scarce resource.

Agentic Context Engineering (ACE) Context Engineering Pyramid Adaptive Omission (Agent-Omit)

Memory Security & Safety

4 papers

Threats, vulnerabilities, and governance frameworks for agent memory systems, including injection attacks, intent legitimation, and safety-governed memory evolution.

MINJA SSGM Framework Intent Legitimation Detection MAESTRO Framework

Experience Accumulation & Workflow Learning

4 papers

Methods enabling agents to extract reusable knowledge from past interactions, including workflow induction, recursive processing, and simulation-based memory.

Agent Workflow Memory (AWM) Recursive Language Models (RLMs) Generative Agent Architecture

💡 Key Insights

💡 Context windows are cache, not memory—treating them as infinite storage wastes over 20% of tokens on structural overhead.

💡 Memory injection attacks succeed at 98% rates through normal queries alone, requiring no privileged access to the memory store.

💡 Reusable workflow extraction from past trajectories yields 50%+ success rate improvements over solving tasks from scratch.

💡 Personalization memory creates safety vulnerabilities: benign retrieved memories can increase attack success rates by up to 243%.

💡 Recursive self-invocation allows LLMs to process inputs two orders of magnitude beyond their native context window limits.

💡 Formal memory governance with ground-truth anchoring is essential to prevent compounding errors from hallucination drift.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research progressed from benchmarking memory limitations (2024) through scaling and security analysis (2025) to production-grade systems treating context as a managed OS resource with formal safety guarantees (2026).

2024-02 to 2024-12 Foundations: benchmarking agent memory limitations and learning reusable workflows from experience
  • (LoCoMo, 2024) established the first very long-term conversational memory benchmark (300+ turns), revealing that even frontier LLMs lag 56–73% behind humans on memory tasks
  • (AWM, 2024) introduced workflow induction from agent trajectories, achieving +51.1% success rate improvement on WebArena through reusable parameterized action templates
  • (AI Persona, 2024) redefined user profiles as dynamic learnable dictionaries continuously updated by an LLM-based optimizer rather than static RAG stores
2025-01 to 2025-12 Scaling memory to population-level simulations, exposing security vulnerabilities, and structured context management
  • (RLMs, 2025) enabled processing inputs two orders of magnitude beyond context limits through recursive self-invocation, outperforming GPT-5 by 28.4%
  • (Generative Agents, 2025) scaled memory-equipped agents to 1,000 individuals, achieving 0.85 correlation with human survey responses
  • (MINJA, 2025) demonstrated query-only memory injection attacks with 98.2% success rate, exposing critical vulnerabilities in agent memory stores
  • (ACE, 2025) introduced structured bullet-based context management with role decomposition, achieving +10.6% on agent benchmarks with 86.9% less latency
  • (Memoria, 2025) combined SQL-based short-term logs with a recency-weighted knowledge graph for scalable personalized conversational memory
2026-01 to 2026-03 Maturation: OS-level memory management, formal governance, language-level safety, and comprehensive field surveys
  • (Pichay, 2026) applied OS virtual memory principles to LLM context, reducing context consumption by 93% in production with only 0.025% page fault rate
  • (Turn, 2026) introduced a compiled language with memory isolation and typed inference as first-class primitives, enabling a multi-agent system in 89 lines of code
  • (SSGM, 2026) formalized memory governance by decoupling memory evolution from verification, introducing ground-truth anchoring against semantic drift
  • (Agent-Omit, 2026) trained agents via RL to adaptively omit redundant thoughts and observations, matching frontier model accuracy at 8B parameter scale
  • (PS-Bench, 2026) revealed that personalization increases attack success rates by up to 243.7%, demonstrating that memory-enhanced safety requires new benchmarks

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Demand Paging for Context Apply operating-system virtual memory principles—demand paging, eviction policies, and fault-driven pinning—to manage LLM context as a scarce cache resource. Naive context stuffing, where all tool definitions, system prompts, and conversation history permanently occupy context regardless of usage The Missing Memory Hierarchy: Demand... (2026)
Agentic Context Engineering Decompose context management into granular bullets with role-separated adaptation and deterministic grow-and-refine merging to prevent information loss during updates. Full-rewrite context adaptation methods like GEPA and Dynamic Cheatsheet that suffer from brevity bias and context collapse Agentic Context Engineering (2025), Context Engineering (2026)
Agent Workflow Memory Induce parameterized workflow templates from successful trajectories so agents can reuse proven strategies instead of solving every task from scratch. Agents that solve each task independently without learning from prior experience, such as baseline ReAct-style approaches Agent Workflow Memory (2024)
Recursive Language Models Let the LLM programmatically decompose and recursively process long inputs via a code environment, treating itself as a callable function. Vanilla long-context models that suffer from 'context rot' and compaction methods that lose critical details Recursive Language Models (2025)
Cognitive Type Safety Make memory isolation and context management compiler-enforced language invariants rather than fragile library conventions. Framework-based approaches in Python/Rust where memory isolation, context bounds, and schema validation are application-level conventions prone to silent failures Turn (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
WebArenaSuccess Rate+51.1% relative improvementAgent Workflow Memory (2024)
LoCoMo (Very Long-Term Conversational Memory)Accuracy (QA tasks)44% of human performance on memory QAEvaluating Very Long-Term Conversational Memory... (2024)
OOLONG / OOLONG-Pairs (Long-Context Processing)F1 Score58.0% F1 on OOLONG-PairsRecursive Language Models (2025)

⚠️ Known Limitations (5)

  • Memory systems lack standardized evaluation benchmarks, making it difficult to compare approaches or measure progress toward human-level memory capabilities (existing benchmarks show 56–73% gaps vs. humans). (affects: LoCoMo, Agent Workflow Memory, Generative Agent Architecture)
    Potential fix: The MemoryArena and LoCoMo benchmarks represent early steps; the field needs unified benchmarks spanning episodic, semantic, and procedural memory.
  • Persistent memory introduces novel attack surfaces where adversaries can inject malicious memories through normal interactions, and current defenses are insufficient against progressive injection strategies. (affects: MINJA, Memoria, AI Persona)
    Potential fix: The SSGM framework proposes decoupling memory evolution from governance with verification protocols and ground-truth anchoring against an immutable observation ledger.
  • Memory consolidation and summarization can cause 'semantic drift'—repeated compression cycles gradually distort or lose critical details, leading to knowledge corruption over time. (affects: Agentic Context Engineering (ACE), SSGM Framework, AI Persona)
    Potential fix: ACE uses deterministic grow-and-refine merging instead of full LLM rewrites; SSGM uses immutable observation ledgers for periodic reconciliation.
  • Most memory architectures are evaluated on specific task domains (web navigation, conversation) and lack evidence of generalization across diverse agent applications and deployment environments. (affects: Agent Workflow Memory, Adaptive Omission (Agent-Omit), Demand Paging (Pichay))
    Potential fix: Cross-domain evaluation like AWM's Mind2Web cross-domain tests and Agent-Omit's 5-benchmark evaluation represent initial efforts toward demonstrating generalization.
  • Enterprise governance for memory-equipped agents remains immature—75% of enterprises plan agent deployment within two years but only 21% have mature governance models for managing persistent agent state. (affects: Context Engineering Pyramid, SSGM Framework)
    Potential fix: The Pyramid of Agent Engineering and MAESTRO framework provide maturity models and layered threat analysis, but operational tooling remains sparse.
📚 View major papers in this topic (10)

💡 To ground the discussion of agent memory in concrete design principles, we begin with Agentic Memory Architecture, which establishes the structural patterns for integrating episodic, semantic, and procedural memory modules into LLM-based agent systems.

📐

Agentic Memory Architecture

What: Agentic Memory Architecture covers design patterns and frameworks for integrating memory modules—such as episodic recall, semantic knowledge stores, and experience banks—into LLM-based agent systems, enabling them to retain, retrieve, and reason over past interactions.

Why: Without structured memory, LLM agents treat every task in isolation, repeating past errors and failing to personalize or adapt over time. Memory architectures are essential for building agents that learn continuously and behave reliably across long-running sessions.

Baseline: The conventional approach is to rely on fixed-length context windows or simple retrieval-augmented generation (RAG), where raw interaction logs are stored in a vector database and retrieved verbatim at inference time, without any structured distillation or cognitive organization.

  • Memory starvation and context flooding: agents either lose critical information as context windows overflow, or are overwhelmed by irrelevant retrieved content
  • Shallow personalization: most systems mimic surface-level style rather than capturing latent user beliefs, preferences, and reasoning patterns
  • Silent cognitive degradation: internal failures such as planner recursion and memory drift accumulate over time without triggering any explicit error signals
  • Experience distillation: extracting generalizable reasoning strategies from raw interaction trajectories rather than simply storing logs

🧪 Running Example

❓ An AI assistant is helping a user who frequently debates policy topics. After 50 prior conversations, the user asks: 'What are the strongest arguments against universal basic income?' The agent should tailor its response to the user's known libertarian-leaning values and analytical communication style.

Baseline: A standard RAG-based agent retrieves the most recent conversation snippets by embedding similarity. It returns generic counter-arguments to UBI without reflecting the user's established ideological lens or preferred argumentation depth, producing a response that feels impersonal and shallow.

Challenge: The agent must distinguish between episodic details (specific past debates the user had) and semantic traits (the user's core beliefs and reasoning style), while avoiding memory drift where hallucinated content from earlier sessions contaminates future responses.

✅ Cognitive Dual-Memory (PRIME): Splits memory into Episodic Memory (recalling specific past debates the user engaged in) and Semantic Memory (internalizing the user's libertarian values and analytical style), then uses 'Personalized Thinking' to generate reasoning traces aligned with the user's belief system before composing the response.
✅ ReasoningBank with MaTTS: Distills structured reasoning strategies from the agent's prior successful and failed attempts at persuasive argumentation, retrieves relevant strategies at test time, and uses Memory-aware Test-Time Scaling to explore multiple argument framings before selecting the one best aligned with the user's preferences.
✅ QSAF Resilience Controls: Monitors for cognitive degradation signals such as memory drift (where hallucinated content from past sessions might distort the user's profile) and context flooding (where too many retrieved memories overwhelm the planner), triggering fallback logic to maintain response quality.

📈 Overall Progress

Agent memory has evolved from flat retrieval buffers to cognitively-inspired architectures that separate memory types, distill reasoning strategies, and monitor for degradation.

💡 Key Insights

💡 Semantic memory (abstracted beliefs) outperforms episodic memory (raw recall) for robust user personalization.

💡 Agents suffer silent cognitive degradation from internal failures, not just external adversarial attacks.

💡 Distilling structured reasoning from past trajectories yields compounding performance gains over time.

💡 Memory-aware test-time scaling enables diverse exploration that improves both accuracy and efficiency.

💡 Cross-session memory poisoning is a real threat where hallucinated content persists across agent interactions.

💡 Cognitively-inspired memory separation mirrors human dual-memory systems and improves agent adaptability.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research in mid-2025 converged on three complementary fronts: cognitive memory organization for personalization, resilience frameworks against internal memory failures, and experience distillation for continuous agent learning—collectively moving the field beyond simple RAG toward structured, self-improving memory systems.

2025-07 to 2025-09 Emergence of structured memory architectures for LLM agents, spanning personalization, resilience, and experience-driven learning
  • (PRIME, 2025) introduced a cognitive dual-memory framework that separates episodic and semantic memory for LLM personalization, demonstrating that semantic memory instantiations are more robust than episodic approaches for capturing user traits
  • (QSAF, 2025) formalized Cognitive Degradation as a vulnerability class in agentic AI, identifying critical failures like planner entrapment and cross-session memory poisoning across LLaMA3, Mixtral, Claude, and ChatGPT
  • (ReasoningBank, 2025) proposed memory-driven experience scaling with MaTTS, achieving +8.3% success rate on WebArena and +34.2% relative improvement on WebArena-Shopping through structured reasoning distillation

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Cognitive Dual-Memory Framework Split personalization memory into episodic recall and semantic belief modeling, then use self-distilled reasoning traces to align outputs with internalized user traits. Fragmented personalization approaches (retrieval-only or fine-tuning-only) that capture surface-level style rather than latent user beliefs PRIME (2025)
Memory-Driven Experience Scaling Extract structured reasoning strategies from both successes and failures, then use them at test time to guide diverse solution exploration with contrastive feedback. Memory-free agents that treat every task in isolation and standard experience replay that stores raw logs without distillation ReasoningBank (2025)
Cognitive Degradation Resilience Model internal agent failures as a formal cognitive degradation lifecycle and deploy runtime behavioral controls that detect and mitigate silent drift before it causes system collapse. Traditional external threat defenses (prompt injection filters) that ignore internally originating agent failures QSAF (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
WebArenaSuccess Rate+8.3% over memory-free baselineReasoningBank (2025)
WebArena-Shopping (with MaTTS parallel scaling k=5)Success Rate+34.2% relative improvementReasoningBank (2025)
Change My View (CMV)Personalization AccuracyBest among all Semantic Memory instantiationsPRIME (2025)

⚠️ Known Limitations (4)

  • Scalability of memory distillation: extracting and indexing structured reasoning strategies from large trajectory histories becomes computationally expensive as the number of interactions grows, potentially limiting deployment in high-throughput settings. (affects: ReasoningBank + MaTTS)
    Potential fix: Hierarchical memory compression and periodic consolidation of older strategies into more abstract summaries, similar to how human memory consolidates during sleep.
  • Memory poisoning and drift: hallucinated or incorrect content stored in vector memory can propagate across sessions, corrupting the agent's knowledge base without any explicit error signal, making it extremely difficult to detect and correct. (affects: Cognitive Degradation Resilience (QSAF), Cognitive Dual-Memory Framework (PRIME))
    Potential fix: QSAF proposes runtime behavioral controls that monitor entropy drift and trigger fallback logic; provenance tracking and periodic memory auditing could further mitigate this.
  • Benchmark coverage for personalization: current benchmarks like CMV focus on debate-style persuasion, but real-world personalization spans diverse domains (e.g., shopping, coding assistance, healthcare), and it remains unclear how well dual-memory approaches generalize. (affects: Cognitive Dual-Memory Framework (PRIME))
    Potential fix: Developing multi-domain personalization benchmarks that test latent belief modeling across diverse task types rather than a single genre.
  • Lack of standardized evaluation for resilience: there is no widely accepted benchmark for measuring cognitive degradation or silent drift in agents, making it difficult to compare resilience frameworks objectively. (affects: Cognitive Degradation Resilience (QSAF))
    Potential fix: Community-developed stress-test suites that systematically induce memory starvation, context flooding, and planner recursion under controlled conditions.
📚 View major papers in this topic (3)

💡 Once an agent's memory architecture is established, the next question is how agents can actively learn from what they store, which is the focus of Experience Replay and Reflection—mechanisms for revisiting past execution traces, distilling lessons from successes and failures, and continuously improving future performance.

🎯

Experience Replay and Reflection

What: Experience replay and reflection encompasses mechanisms that enable AI agents to learn from past interactions by storing, retrieving, and reasoning over prior execution traces, successes, and failures.

Why: Without the ability to learn from accumulated experience, agents repeatedly make the same mistakes, waste computation rediscovering known solutions, and fail to improve over time—mirroring a worker who never keeps notes.

Baseline: Conventional agents use static pipelines with no persistent memory: each new task is approached from scratch, relying solely on the model's pretrained knowledge and in-context examples without any history of past trials.

  • Catastrophic forgetting: replaying old experiences can interfere with learning new tasks, requiring careful scheduling of what and when to replay
  • Memory scalability: storing full execution traces grows prohibitively expensive; agents must selectively curate which experiences to retain
  • Cross-system knowledge transfer: experiences captured in one agent framework are typically incompatible with another, preventing collective learning
  • Reflection quality: self-reflection can be shallow or hallucinatory unless grounded in verifiable evidence from actual execution outcomes

🧪 Running Example

❓ An AI agent is asked to solve a series of Kaggle-style machine learning competitions sequentially. After completing five competitions, it encounters a sixth that shares characteristics with the second competition.

Baseline: A baseline agent without experience replay starts from scratch on each competition, re-trying hyperparameter combinations and data-processing strategies that already failed in earlier tasks. It cannot recall that a similar feature-engineering approach worked well in competition #2.

Challenge: The agent must balance retaining useful insights from all five prior competitions (stability) while adapting to the new competition's unique requirements (plasticity), without its memory growing unboundedly.

✅ Persistent Memory Evolution: EvoScientist's Evolution Manager distills insights from every past run into Ideation Memory and Experimentation Memory, so the agent retrieves proven strategies from competition #2 and avoids known failure patterns.
✅ Adaptive Replay Scheduling: MSSR models each past data sample's 'memory strength' using a forgetting curve, scheduling frequent replay for rapidly fading knowledge and sparse replay for well-consolidated skills, preventing forgetting of competition #2 insights.
✅ Cross-Framework Experience Transfer: AGENTKB's universal memory layer would allow the agent to retrieve successful workflows from a completely different agent framework that solved a similar competition, injecting fixes via a Reason-Retrieve-Refine loop.
✅ Reflection-Driven Control: The Reflection-Driven Control module interrupts the agent's generation loop when it detects risky or previously-failed code patterns, grounding corrections in a dual-layer Reflective Memory of past repairs and known best practices.

📈 Overall Progress

Experience replay has evolved from isolated per-task memory to persistent, cross-framework knowledge systems that enable agents to continuously self-improve across runs and architectures.

💡 Key Insights

💡 Agents that persist and retrieve past experiences dramatically outperform stateless systems that start each task from scratch.

💡 Cognitive science principles like spaced repetition and forgetting curves transfer effectively to LLM continual learning.

💡 Cross-framework memory sharing unlocks collective intelligence that no single agent architecture can achieve alone.

💡 Structured reflection integrated into the reasoning loop is far more effective than post-hoc self-correction patches.

💡 Separating memory into distinct stores (ideation vs. experimentation, short-term vs. long-term) improves both plasticity and stability.

💡 Selective memory retrieval—feeding only relevant experiences—prevents context overflow and reasoning disruption.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (mid-2025) focused on structuring how agents store and retrieve past experiences—via tree-structured exploration, cross-framework schemas, and reflection loops. By early 2026, the field shifted toward principled replay scheduling inspired by cognitive science and multi-agent evolution with dual memory systems.

2025-06 to 2025-12 Foundational frameworks for experience-driven agent improvement
  • (ML-Master, 2025) reformulated AI development as Monte Carlo Tree Search with adaptive memory, achieving 29.3% medal rate on MLE-Bench—surpassing the prior best of 22.4%
  • (AGENTKB, 2025) introduced a universal cross-framework memory layer enabling +18.7pp improvement on GAIA and +17.0pp on SWE-bench Lite through collective experience sharing
  • (Reflection-Driven, 2025) elevated self-reflection to a first-class internal control circuit with Plan–Reflect–Verify, grounding corrections in verified past repairs
2026-01 to 2026-03 Scaling replay mechanisms for continual learning and multi-agent evolution
  • (EvoScientist, 2026) introduced dual persistent memories with an Evolution Manager for scientific discovery, achieving 100% paper acceptance at ICAIS 2025 including Best Paper Award
  • (MSSR, 2026) modeled per-sample memory strength via Ebbinghaus forgetting curves for continual LLM fine-tuning, outperforming baselines across 3 backbone models on 11-task sequences
  • (ARROW, 2026) combined short-term and long-term replay buffers with reservoir sampling for continual RL, achieving 4x less forgetting on Atari benchmarks

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Persistent Memory Evolution Agents maintain evolving long-term memories that are continuously updated with distilled insights from every interaction, so future runs start from an increasingly informed baseline. Static agent pipelines that treat each run independently with no cross-run learning EvoScientist (2026), ML-Master (2025)
Adaptive Replay Scheduling Model each sample's forgetting risk and prioritize replay for rapidly fading knowledge, mirroring how human memory benefits from spaced repetition. Fixed-interval and random replay strategies that waste compute on already-consolidated knowledge or react too late to forgetting MSSR (2026), ARROW (2026)
Cross-Framework Experience Transfer Unify execution traces from incompatible agent frameworks into a shared memory layer so agents can learn from collective experience across systems. Framework-specific memory systems where knowledge is trapped within individual agent architectures AGENTKB (2025)
Reflection-Driven Control Make reflection an explicit, structured step in the agent's generation loop rather than an afterthought, using verified past repairs to ground self-correction. Post-hoc safety patches and unstructured self-reflection that lack integration into the agent's internal reasoning process Reflection-Driven (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
MLE-BenchAverage Medal Rate29.3%ML-Master (2025)
GAIAPass@373.9%AGENTKB (2025)
SWE-bench LitePass@10045.7%AGENTKB (2025)

⚠️ Known Limitations (4)

  • Memory curation overhead: deciding what to store, when to forget, and how to index experiences adds significant engineering complexity and computational cost, especially as interaction histories grow. (affects: Persistent Memory Evolution, Adaptive Replay Scheduling, Cross-Framework Experience Transfer)
    Potential fix: Reservoir sampling and memory-strength modeling can automate curation, but optimal forgetting policies remain an open problem.
  • Reflection hallucination: when self-reflection is not grounded in verifiable execution outcomes, agents may generate plausible but incorrect diagnoses of their failures, compounding errors. (affects: Reflection-Driven Control, Persistent Memory Evolution)
    Potential fix: Grounding reflection in dual-layer memory (dynamic past repairs + static standards) and routing only risky outputs through reflection, as proposed by Reflection-Driven Control.
  • Cross-framework schema fragility: abstracting execution traces into a universal schema risks losing framework-specific details that are critical for reproducing successful workflows in their original context. (affects: Cross-Framework Experience Transfer)
    Potential fix: The disagreement gate in AGENTKB partially addresses this by filtering out retrieved knowledge that conflicts with the agent's current reasoning, but richer schema representations may be needed.
  • Evaluation generalizability: most methods are evaluated on specific benchmark suites (Atari, MLE-Bench, GAIA), and their effectiveness in open-ended, real-world deployment scenarios remains largely unvalidated. (affects: Adaptive Replay Scheduling, Persistent Memory Evolution, Cross-Framework Experience Transfer)
    Potential fix: Broader evaluation across diverse task distributions and longer time horizons would strengthen confidence in these methods' practical utility.
📚 View major papers in this topic (6)

💡 The refined strategies and procedural knowledge distilled through experience replay naturally feed into Memory-Augmented Planning, where agents retrieve and leverage their accumulated experiences to make better decisions and execute multi-step plans more efficiently.

🔄

Memory-Augmented Planning

What: Memory-augmented planning studies how AI agents can store, retrieve, and leverage past experiences—such as solved tasks, interaction histories, and procedural skills—to improve future planning and decision-making, rather than reasoning from scratch each time.

Why: Without persistent memory, agents waste computation rediscovering solutions to previously encountered problems and cannot transfer hard-won knowledge across tasks, domains, or frameworks. Memory-augmented planning is essential for building agents that improve over time and operate efficiently in real-world settings.

Baseline: Baseline agents treat each task independently, relying solely on the current prompt context and tool documentation. They have no mechanism to recall prior successes or failures, leading to repeated mistakes and inefficient exploration.

  • Deciding what to remember: filtering useful experience from noise without exceeding storage or retrieval budgets
  • Transferring knowledge across heterogeneous agent frameworks and domains without introducing conflicting or stale information
  • Balancing short-term task-specific context with long-term generalizable knowledge to avoid overfitting to past solutions
  • Retrieving relevant memories at the right time during multi-step planning under latency and compute constraints

🧪 Running Example

❓ An AI coding agent is asked to fix a failing CI pipeline caused by a dependency conflict between two packages, a problem pattern that has been solved in other projects and frameworks before.

Baseline: A standard agent reads the error log and attempts to resolve the conflict from scratch. It may try several incorrect approaches (e.g., pinning the wrong version, removing a needed dependency) because it has no memory of how similar conflicts were resolved previously, wasting multiple iterations.

Challenge: The solution exists in execution traces from a different agent framework (e.g., OpenHands solved a similar conflict last month), but that knowledge is siloed. Even within the same framework, the agent's previous successful fix was lost when the context window cleared.

✅ Cross-Framework Experience Replay (AGENTKB): AGENTKB retrieves a matching execution trace from its universal memory layer—even if the original fix was performed by a different agent framework—and injects the successful resolution strategy into the current agent's planning loop.
✅ Simulated Trial and Error (STE): STE's long-term memory contains distilled success patterns from past tool-use episodes; it retrieves the relevant pattern for dependency resolution, enabling the agent to skip failed approaches and converge on a correct fix quickly.
✅ Reusable Agentic Skills: A pre-formalized 'resolve-dependency-conflict' skill (with applicability conditions, execution policy, and termination criteria) is retrieved from the skill library, providing a tested procedure instead of ad-hoc reasoning.

📈 Overall Progress

Memory-augmented planning has evolved from biologically inspired dual-memory learning to universal cross-framework experience sharing and deployment-ready predictive memory systems.

📂 Sub-topics

Cross-Framework Experience Transfer

3 papers

Methods for abstracting, storing, and reusing agent execution traces and procedural knowledge across different agent architectures and task domains.

Cross-Framework Experience Replay Reusable Agentic Skill Formalization

Dual-Memory Learning Architectures

2 papers

Approaches that separate agent memory into short-term (within-episode) and long-term (cross-episode) stores, inspired by biological memory systems, to balance exploration depth with experience breadth.

Simulated Trial and Error Dual Short/Long-Term Agent Memory

Predictive and Domain-Structured Memory

4 papers

Systems that organize memory around domain-specific structures (clinical records, social interaction histories, hardware profiles) or use predictive pre-fetching to reduce retrieval latency during planning.

Predictive Memory Pre-fetching Domain-Structured Memory

💡 Key Insights

💡 Cross-framework memory sharing yields double-digit accuracy gains by breaking knowledge silos between agent architectures.

💡 Biologically inspired dual memory (short-term + long-term) enables small models to outperform GPT-4 on tool use.

💡 Self-generated skills often degrade performance; curated skill libraries are significantly more reliable.

💡 Predictive memory pre-fetching can reduce retrieval latency by over 300x for real-time voice applications.

💡 Domain-structured memory (e.g., clinical document trees) outperforms generic vector stores in safety-critical settings.

💡 Standardized evaluation protocols are essential—unstandardized agent comparisons produce high-variance, unreproducible results.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (2024) focused on giving individual agents short-term and long-term memory inspired by human cognition. By mid-2025, the field shifted toward breaking down memory silos across agent frameworks with universal experience layers. The latest work (2026) emphasizes formalization of reusable procedural skills, domain-structured memory for safety-critical deployments, and latency-optimized memory access for real-time applications.

2024-03 to 2025-05 Biologically inspired dual-memory architectures for agent learning
  • (STE, 2024) introduced biologically inspired simulated trial-and-error with short-term and long-term memory, enabling a 7B parameter model to surpass GPT-4 on tool-use accuracy (76.8% vs 60.8%)
  • (MAGS, 2025) extended dual-memory to multi-agent feature engineering, using a Router agent with short-term trajectory refinement and long-term demonstration retrieval
2025-06 to 2025-12 Modular frameworks, cross-framework memory, and social grounding
  • (OAgents, 2025) demonstrated that modular adaptive memory with periodical plan revision achieves state-of-the-art on GAIA among open-source agent frameworks
  • (AGENTKB, 2025) introduced a universal cross-framework memory layer, improving GAIA pass@3 by 18.7 percentage points and SWE-bench Lite pass@100 by 17.0 percentage points
  • (Social-RAG, 2025) treated group interaction history as a social knowledge base, successfully deploying in 18 Slack channels with 500+ researchers
2026-01 to 2026-03 Formalization of skill memory, deployment-ready systems, and latency optimization
  • (Agentic Skills, 2026) formalized skills as 4-tuple persistent memory modules with a 7-stage lifecycle, showing curated skills improve pass rates by 16.2 percentage points while self-generated skills can degrade performance
  • (VoiceAgentRAG, 2026) introduced predictive memory pre-fetching with a dual-agent architecture, achieving 316x retrieval speedup on cache hits for voice AI
  • (AOSH, 2026) replaced vector embeddings with Page-Indexed Memory for secure clinical agent deployment with least-privilege execution
  • (HeRo, 2026) optimized agentic RAG memory access patterns on mobile SoCs, reducing end-to-end latency by up to 10.94x

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Cross-Framework Experience Replay Unify agent experiences from heterogeneous frameworks into a single, framework-agnostic memory that any agent can query during planning. Framework-specific memory systems that trap knowledge within individual agent architectures AGENTKB (2025)
Simulated Trial and Error Let agents learn tool use through simulated practice with biologically inspired short-term and long-term memory, rather than from static documentation alone. Documentation-based tool learning and standard supervised fine-tuning on tool-use examples LLMs (2024), Agentic Feature Augmentation (2025)
Reusable Agentic Skill Formalization Formalize agent procedures as reusable, self-contained skill modules with explicit conditions for when and how to apply them. Ad-hoc planning where agents re-derive execution strategies from scratch for every recurring task SoK (2026), OAgents (2025)
Predictive Memory Pre-fetching Use idle time during the current conversational turn to speculatively retrieve and cache documents the agent will likely need for future turns. Standard synchronous RAG retrieval that blocks response generation with 50–300ms lookup latency VoiceAgentRAG (2026)
Domain-Structured Memory Systems Structure agent memory to mirror domain-specific information organization rather than relying on generic embedding-based retrieval. Generic vector-embedding memory that lacks domain-aware organization and auditability When OpenClaw Meets Hospital: Toward... (2026), Social-RAG (2025), HeRo (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
GAIAPass@3 accuracy73.9%AGENTKB (2025)
Tool-Use Correctness (ToolBench)Correctness percentage76.8%LLMs (2024)
SkillsBenchPass rate improvement (percentage points)+16.2pp pass rateSoK (2026)

⚠️ Known Limitations (5)

  • Self-generated skills and memories can encode incorrect heuristics, degrading performance rather than improving it. This matters because autonomous memory accumulation without quality control can compound errors over time. (affects: Reusable Agentic Skill Formalization, Cross-Framework Experience Replay)
    Potential fix: Human curation of skill libraries, automated validation of stored experiences against ground truth, and disagreement gates that filter conflicting retrieved knowledge.
  • Memory retrieval can inject stale or conflicting information into an agent's planning loop, especially when experiences come from different domains or framework versions. This can cause the agent to pursue outdated strategies. (affects: Cross-Framework Experience Replay, Simulated Trial and Error (STE))
    Potential fix: AGENTKB's disagreement gate filters conflicting knowledge, but general solutions for memory staleness detection and expiration remain underexplored.
  • Security and auditability concerns arise when agents have broad memory access in sensitive domains like healthcare. Unrestricted memory retrieval could leak private information or lead to unauthorized actions. (affects: Domain-Structured Memory Systems, Cross-Framework Experience Replay)
    Potential fix: AOSH enforces least-privilege execution with restricted Linux namespaces and audit trails via document-mutation coordination, but this approach is domain-specific and not yet generalized.
  • Predictive pre-fetching relies on accurate topic prediction; cache misses fall back to full retrieval latency, and prediction errors waste compute on irrelevant documents. (affects: Predictive Memory Pre-fetching)
    Potential fix: Improving prediction models with richer conversational context and maintaining hybrid retrieval strategies that gracefully degrade on cache misses.
  • Most memory-augmented planning systems are evaluated on specific benchmarks (GAIA, ToolBench) and lack evidence of generalization to truly open-ended, long-horizon real-world tasks. (affects: Cross-Framework Experience Replay, Simulated Trial and Error (STE), Reusable Agentic Skill Formalization)
    Potential fix: Developing more diverse, long-horizon evaluation benchmarks and testing memory systems in production deployments over extended time periods.
📚 View major papers in this topic (7)

💡 As individual agents become more capable planners through memory, the frontier extends to Multi-Agent Shared Memory—architectures and protocols that enable teams of agents to pool their knowledge, coordinate through a common memory layer, and collectively outperform what any single agent can achieve.

🔍

Multi-Agent Shared Memory

What: Multi-agent shared memory encompasses the architectures, protocols, and systems that allow multiple LLM-based agents to store, retrieve, and coordinate through a common knowledge layer, enabling collective intelligence beyond what any single agent can achieve.

Why: As LLM agents move from solo tools to collaborative teams, they need structured ways to share context, avoid conflicting actions, and accumulate experience—without such memory systems, multi-agent collaboration degrades into redundant, inconsistent work.

Baseline: The conventional approach gives each agent its own isolated context window or retrieval store, requiring explicit message-passing for every piece of shared information and offering no persistent cross-agent memory.

  • Memory consistency: ensuring all agents see up-to-date, non-contradictory information when reading and writing concurrently
  • Cross-framework interoperability: transferring experience and knowledge between agents built on different architectures and frameworks
  • Access control and security: enforcing fine-grained permissions so agents only read or modify data they are authorized to access
  • Scalable coordination: maintaining low latency and high throughput as the number of collaborating agents grows

🧪 Running Example

❓ A hospital deploys three specialized agents—a triage agent, a treatment-planning agent, and a discharge-summary agent—that must collaboratively manage a patient's evolving medical record over a multi-day stay.

Baseline: With isolated memory, the treatment-planning agent cannot see the triage agent's latest notes without an explicit handoff message; if the triage agent updates an allergy list, the treatment agent may propose a contraindicated drug because its context is stale.

Challenge: The patient's record is updated by all three agents concurrently: the triage agent logs new vitals, the treatment agent adds medication orders, and the discharge agent drafts summaries—all must remain consistent, auditable, and access-controlled.

✅ Three-Layer Memory Hierarchy: Organizes agent memory into I/O, cache, and persistent layers so the treatment agent's working memory is automatically refreshed from the shared persistent store, preventing stale reads.
✅ Document-Mutation Coordination (AOSH): Agents communicate solely by writing structured updates to clinical records, creating an automatic audit trail and ensuring every agent reads the latest version of the patient chart.
✅ Model Context Protocol (MCP): Provides a standardized interface so all three agents—potentially built on different LLM backends—can connect to the same patient data sources and tools without custom integration.
✅ Cross-Framework Knowledge Transfer (AgentKB): Allows past treatment-planning experiences from one hospital's agent system to be reused by a different hospital's framework, avoiding repeated diagnostic mistakes.

📈 Overall Progress

Multi-agent memory has evolved from ad-hoc isolated stores to formally structured, protocol-driven shared memory systems with consistency guarantees.

📂 Sub-topics

Memory Architecture and Consistency

2 papers

Frameworks that define how multi-agent memory is structured, layered, and kept consistent, drawing on principles from computer architecture such as cache hierarchies and coherence protocols.

Three-Layer Memory Hierarchy Document-Mutation Coordination

Cross-Agent Context Protocols and Knowledge Transfer

2 papers

Standardized protocols and memory layers that enable diverse agents—potentially built on different frameworks—to share context, transfer experience, and build collective intelligence.

Model Context Protocol Universal Cross-Framework Memory Layer

💡 Key Insights

💡 Memory consistency across agents is the most critical unsolved challenge, analogous to cache coherence in multiprocessor hardware.

💡 Cross-framework knowledge transfer yields large gains (+18.7pp on GAIA), proving collective memory outperforms isolated experience.

💡 Standardized context protocols eliminate brittle point-to-point integrations, enabling plug-and-play multi-agent collaboration.

💡 Safety-critical domains demand page-indexed, audit-trailed memory with least-privilege access rather than flat vector stores.

💡 Hardware memory hierarchy concepts (I/O, cache, persistent store) transfer effectively to organizing LLM agent memory.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (2025) focused on standardizing how agents share context through universal protocols and cross-framework knowledge bases. By 2026, the field shifted toward formalizing memory architectures with hardware-inspired hierarchies and deploying shared memory in safety-critical domains with strict access control.

2025-04 to 2025-07 Establishing standardized protocols and cross-framework memory for multi-agent collaboration
  • (MCP, 2025) introduced a standardized context-sharing protocol acting as 'USB-C for context,' decoupling memory management from individual agent logic
  • (AgentKB, 2025) created a universal cross-framework memory layer, achieving +18.7pp improvement on GAIA (55.2% → 73.9%) over framework-isolated baselines
2026-03 to 2026-03 Formalizing memory architectures and deploying shared memory in safety-critical domains
  • (CA-Memory, 2026) formalized multi-agent memory as a three-layer hierarchy (I/O, Cache, Memory) and identified consistency as the most pressing open challenge
  • (AOSH, 2026) deployed page-indexed memory with document-mutation coordination in a hospital agentic OS, enforcing least-privilege execution for clinical safety

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Three-Layer Memory Hierarchy Apply hardware-inspired cache hierarchies and coherence protocols to give multi-agent systems structured, consistent shared memory. Ad-hoc shared context stores that lack formal consistency guarantees and treat all memory accesses uniformly regardless of latency requirements Multi-Agent (2026)
Document-Mutation Coordination Agents share state by mutating structured documents rather than passing messages, combining navigable page-indexed memory with strict access-control isolation. General-purpose agent frameworks that use flat vector stores and lack the security, auditability, and longitudinal memory required for safety-critical domains like healthcare When OpenClaw Meets Hospital: Toward... (2026)
Model Context Protocol A universal plug-and-play protocol that standardizes how agents access shared context, eliminating custom integrations between heterogeneous agent architectures. Bespoke point-to-point integrations where each agent-to-agent or agent-to-tool connection requires custom code, making systems fragile and hard to scale Advancing Multi-Agent Systems Through Model... (2025)
Universal Cross-Framework Memory Layer Unify experience from multiple incompatible agent frameworks into a shared, framework-agnostic memory so agents never rediscover known solutions or repeat known mistakes. Framework-specific memory systems that trap knowledge within individual agent architectures, preventing cross-system learning and collective intelligence AGENTKB (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
GAIApass@373.9%AGENTKB (2025)
SWE-bench Litepass@10045.7%AGENTKB (2025)
Humanity's Last Exam (Bio/Chem)pass@314.1%AGENTKB (2025)

⚠️ Known Limitations (4)

  • Memory consistency remains unsolved: no existing system guarantees that concurrent agent reads and writes produce conflict-free, up-to-date results, which can cause agents to act on stale or contradictory information. (affects: Three-Layer Memory Hierarchy, Document-Mutation Coordination (AOSH))
    Potential fix: Adapting formal cache coherence protocols (e.g., MESI) from hardware to agent memory, with conflict detection and resolution mechanisms.
  • Evaluation is largely qualitative or domain-specific: most papers either provide no quantitative evaluation or test only in narrow domains, making it difficult to compare approaches or assess generalization. (affects: Three-Layer Memory Hierarchy, Model Context Protocol, Document-Mutation Coordination (AOSH))
    Potential fix: Developing standardized multi-agent memory benchmarks that test consistency, latency, and coordination quality across diverse scenarios.
  • Scalability to large agent populations is untested: current systems demonstrate coordination among small numbers of agents, leaving open questions about performance degradation as agent counts grow to dozens or hundreds. (affects: Model Context Protocol, Universal Cross-Framework Memory Layer (AgentKB), Document-Mutation Coordination (AOSH))
    Potential fix: Borrowing distributed-systems techniques such as sharding, replication, and eventual consistency to handle larger agent populations.
  • Security and access control add latency and complexity: enforcing least-privilege execution and audit trails in shared memory systems introduces overhead that may conflict with the low-latency requirements of real-time agent collaboration. (affects: Document-Mutation Coordination (AOSH), Three-Layer Memory Hierarchy)
    Potential fix: Lightweight capability-based access control models that enforce permissions with minimal runtime overhead.
📚 View major papers in this topic (4)

💡 With memory systems spanning from individual agent architectures to multi-agent shared knowledge, Agent Memory Evaluation develops the benchmarks and metrics needed to assess whether agents can truly leverage accumulated memory to guide decisions across multi-session, interdependent tasks.

📋

Agent Memory Evaluation

What: Agent Memory Evaluation covers benchmarks, evaluation frameworks, and metrics designed to assess how effectively AI agents acquire, retain, and use memory to guide future decisions across multi-session interactions.

Why: Existing benchmarks test memorization and action in isolation, failing to capture whether agents can actively leverage accumulated experience to solve progressively complex tasks—a critical capability for real-world deployment.

Baseline: Conventional evaluation either measures static recall accuracy (e.g., QA over past conversations) without requiring action, or tests single-session agent performance where long-term memory is unnecessary.

  • Coupling memorization with action: evaluating whether recalled information actually improves downstream task completion, not just retrieval accuracy
  • Designing interdependent multi-session tasks where later sessions are underspecified without memory from earlier sessions
  • Scaling evaluation to long horizons (50+ action steps, 40k+ token traces) that stress both memory capacity and reasoning

🧪 Running Example

❓ An agent is asked to book a hotel in a new city, but the preferred hotel chain, loyalty number, and room preferences were only mentioned across three prior shopping and travel sessions weeks ago.

Baseline: A baseline agent without structured memory evaluation treats each session independently, failing to recall the user's loyalty program or room preferences, and either asks redundant questions or books a suboptimal hotel.

Challenge: The booking task is deliberately underspecified—critical constraints (loyalty chain, bed type, floor preference) were established in earlier sessions and must be distilled from accumulated experience rather than explicitly restated.

✅ MemoryArena Evaluation Framework: MemoryArena structures evaluation as interdependent subtasks across sessions, measuring whether the agent correctly recalls and applies the loyalty number, chain preference, and room type from prior sessions to complete the booking, using task completion rate rather than recall accuracy as the metric.

📈 Overall Progress

Evaluation of agent memory is shifting from passive recall accuracy to action-coupled task completion across interdependent multi-session settings.

💡 Key Insights

💡 Agents with near-perfect static memory recall perform poorly when memory must drive multi-session action.

💡 Evaluation must couple memorization with downstream task completion to reveal true memory capability gaps.

💡 Long-horizon tasks (57+ steps, 40k+ tokens) expose failures in maintaining latent task states across sessions.

💡 Four diverse domains (shopping, travel, search, reasoning) are needed to stress-test different memory usage patterns.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work in this area reveals a critical gap: agents that excel at static memory benchmarks struggle when memory must actively guide decisions in progressive, multi-session tasks, motivating a new generation of evaluation frameworks.

2026-02 to 2026-02 Emergence of action-coupled memory evaluation benchmarks
  • (MemoryArena, 2026) introduced a benchmark evaluating agent memory through interdependent multi-session tasks across four domains, revealing that agents with near-saturated static memory scores perform poorly when memory must guide action

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Memory-Agent-Environment Loop Evaluation Evaluate memory through progressive, interdependent tasks where correct recall is a prerequisite for successful action, not a standalone metric. Static memory benchmarks that test recall in isolation (e.g., QA over past conversations) and single-session agent benchmarks that do not require long-term memory Benchmarking Agent Memory in Interdependent... (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
MemoryArenaTask Completion RateLow task completion ratesBenchmarking Agent Memory in Interdependent... (2026)

⚠️ Known Limitations (3)

  • Limited coverage of memory evaluation approaches: with only one benchmark paper, the landscape of evaluation methodologies remains underexplored, making it difficult to establish consensus on best practices for memory assessment. (affects: Memory-Agent-Environment Loop Evaluation)
    Potential fix: Development of complementary benchmarks targeting different memory types (episodic, semantic, procedural) and interaction patterns.
  • Scalability of multi-session evaluation: tasks averaging 57 action steps and 40k+ token traces are computationally expensive to run, potentially limiting widespread adoption and iteration speed. (affects: Memory-Agent-Environment Loop Evaluation)
    Potential fix: Developing tiered evaluation suites with lightweight proxy tasks for rapid iteration alongside full-scale benchmarks for comprehensive assessment.
  • Domain coverage: current evaluation spans four domains (shopping, travel, search, reasoning), but real-world agents operate in many more settings including coding, scientific research, and social interaction. (affects: Memory-Agent-Environment Loop Evaluation)
    Potential fix: Extending the benchmark framework to additional domains and allowing community contributions of new task environments.
📚 View major papers in this topic (1)

💡 Beyond the core categories of organization, recall, and agentic memory, a rich collection of Other Topics addresses cross-cutting concerns—from LLM personalization and memory security to hardware in-memory computing and theoretical foundations—that collectively shape how memory is implemented and optimized across the full AI stack.

📦

Other Topics

What: This topic encompasses papers that do not fit into the main memory categories but contribute to the broader memory landscape, spanning LLM personalization, efficient inference and training, continual learning, spatial memory for vision, hardware in-memory computing, memory security, and theoretical foundations of memory in neural systems.

Why: These diverse contributions collectively shape how memory is understood, implemented, and optimized across the AI stack—from silicon hardware to high-level agent cognition—filling critical gaps that no single core memory category addresses.

Baseline: Baseline approaches typically treat memory as monolithic: LLMs process full contexts indiscriminately, training uses standard backpropagation with full optimizer states, and hardware relies on the Von Neumann architecture with separate memory and compute units.

  • Scaling memory efficiently: KV caches grow linearly with sequence length, optimizer states consume 2-3x model size, and hardware bandwidth lags behind compute scaling
  • Balancing personalization with factual reliability: incorporating user history risks entangling preferences with facts, causing hallucinations aligned with user biases rather than truth
  • Preventing catastrophic forgetting in sequential learning while maintaining plasticity for new tasks within bounded compute and memory resources
  • Securing persistent memory against adversarial manipulation, where poisoned memories can trigger unauthorized actions in future sessions

🧪 Running Example

❓ A user asks their personalized AI assistant: 'Plan a birthday dinner for my partner.' The assistant has months of conversation history including dietary preferences, restaurant visits, and budget discussions.

Baseline: A standard LLM either ignores the history entirely (generating a generic restaurant list) or naively retrieves all past mentions of food, flooding the context with irrelevant details and potentially exceeding the context window. If the user once mentioned disliking sushi in a joke, the system may incorrectly exclude all Japanese restaurants.

Challenge: The assistant must selectively retrieve relevant preferences (partner's dietary restrictions, budget, location), distinguish genuine preferences from casual mentions, handle evolved preferences (the partner recently became vegetarian), and do all this within a bounded KV cache without degrading response quality.

✅ Rational Personalization (RPEval/RP-Reasoner): Uses Bayesian pragmatic reasoning to determine which memories are actually relevant to the current query, filtering out irrelevant preferences and reducing hallucination risk by 35%
✅ LookaheadKV Cache Eviction: Compresses the KV cache by predicting which tokens the future response will actually attend to, reducing memory by 80% while maintaining quality for long conversation histories
✅ GaLore (Memory-Efficient Training): Enables fine-tuning the assistant on user-specific data on a single consumer GPU by projecting gradients into low-rank subspaces, reducing optimizer memory by 65%

📈 Overall Progress

The field has evolved from treating memory as a passive storage layer to actively engineering it as a first-class system component—with theoretical guarantees, hardware co-design, and security considerations.

📂 Sub-topics

LLM Personalization

20 papers

Methods for tailoring LLM outputs to individual users via retrieval, embedding injection, reinforcement learning from interaction, and causal preference modeling.

Retrieval-Augmented Personalization Embedding-Based Persona Injection RL for Personalized Alignment Difference-Aware User Modeling

Efficient LLM Inference

22 papers

Techniques for reducing inference cost including KV cache compression/eviction, structured pruning, speculative decoding, and sparse attention optimization.

KV Cache Eviction Structured Pruning Speculative Decoding Sparse Attention

Memory-Efficient Training

5 papers

Methods to reduce GPU memory consumption during LLM pre-training and fine-tuning, including gradient projection, layerwise sampling, and zeroth-order optimization.

Gradient Low-Rank Projection Layerwise Importance Sampling Zeroth-Order Optimization

Continual Learning & Forgetting Prevention

8 papers

Approaches to enable models to learn from sequential data streams without catastrophically forgetting previous knowledge, including information-theoretic frameworks, gated adaptation, and causal feature expansion.

Context Channel Capacity Gated Modulation on Frozen Backbones Representation Fine-tuning

Spatial Memory for Vision & 3D

7 papers

External memory architectures for maintaining 3D spatial consistency in video generation, world simulation, and robotic manipulation, inspired by biological working and episodic memory.

Geometry-Indexed Memory Explicit Spatial Pointer Memory Perceptual-Cognitive Memory Bank

Hardware Memory & In-Memory Computing

10 papers

Physical memory technologies (memristors, phase-change memory, Processing-in-Memory) and analysis of the memory wall bottleneck for AI workloads.

Processing-In-Memory Phase-Change Memory Logic 2D Material Memristors

Memory Evaluation & Benchmarks

7 papers

Benchmarks and evaluation frameworks for assessing memory capabilities of LLM agents, including factual recall, cognitive memory, preference tracking, and multi-turn consistency.

Incremental Multi-Turn Evaluation Constraint-Consistency Evaluation Multi-Level Memory Assessment

Memory Security & Adversarial Attacks

3 papers

Vulnerabilities in memory-augmented systems including poisoning attacks on RAG knowledge bases, hidden state corruption in SSMs, and context manipulation in agent memory.

Embedding Space Manipulation Hidden State Poisoning Defense Context Manipulation via Memory Injection

Theoretical Foundations of Memory

8 papers

Fundamental theories connecting memory to attention mechanisms, position encoding, biological neural circuits, and information-theoretic principles.

Feed-Forward as Key-Value Memory Geometric Position Bias Theory Biological Key-Value Memory

💡 Key Insights

💡 Memory bandwidth, not compute, is the primary bottleneck for modern AI—scaling at 1.6x/2yrs vs 3.0x/2yrs for FLOPS.

💡 Frontier models achieve only ~50% on personalization tasks requiring evolving user tracking, barely above chance.

💡 Catastrophic forgetting has a provable information-theoretic bound: context channel capacity must exceed task entropy.

💡 Persistent agent memory is a critical attack surface—poisoned memories achieve >80% attack success on frontier models.

💡 Gradient low-rank projection enables 7B model pre-training on a single 24GB consumer GPU without sacrificing quality.

💡 The 'Lost in the Middle' attention bias is a geometric property at initialization, not a learned artifact of training.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research has progressed from foundational theory (FFN-as-memory, memory wall analysis) through practical memory-efficient training breakthroughs (GaLore, LISA) to sophisticated memory-aware systems for personalization, spatial reasoning, and agent security, with 2026 bringing formal information-theoretic frameworks that unify previously disparate empirical findings.

2021-11 to 2023-11 Foundational theories connecting memory to neural architectures and early personalization
  • (FFN-as-KV-Memory, 2021) reinterpreted Transformer feed-forward layers as key-value memories, revealing that lower layers detect shallow patterns while upper layers encode semantic concepts
  • (RowHammer, 2023) documented a decade of DRAM vulnerability research showing >80% of commodity DRAM modules are susceptible to read-disturbance bitflips
  • (BB-LDPC, 2023) achieved quantum error correction protecting 12 logical qubits with only 288 physical qubits, a >10x overhead reduction over surface codes
  • (DBE, 2023) proposed decoupling global and client-specific representations in federated learning, improving accuracy by up to 32.3%
2024-01 to 2024-12 Memory-efficient training breakthroughs and the emergence of memory-augmented agents
  • (GaLore, 2024) enabled pre-training LLaMA 7B on a single 24GB consumer GPU by projecting gradients into low-rank subspaces, reducing optimizer memory by 65.5%
  • (LISA, 2024) outperformed LoRA by 11-38% on MT-Bench by randomly unfreezing layer subsets, achieving full-parameter quality at LoRA-level memory cost
  • (AgentPoison, 2024) demonstrated that RAG-based agent memory can be poisoned via embedding space manipulation, achieving >80% attack success with <0.1% poison rate
  • (MemoryWall, 2024) quantified the widening gap between compute scaling (3.0x/2yrs) and memory bandwidth scaling (1.6x/2yrs), establishing memory as the primary AI bottleneck
2025-01 to 2025-12 Personalization matures, spatial memory emerges, and memory evaluation frameworks are established
  • (RLPA, 2025) formulated personalization as a multi-turn MDP with simulated users, outperforming SFT by 29 points and surpassing GPT-4o on generalization benchmarks
  • Point3R (Point3R, 2025) introduced explicit spatial pointer memory with 3D-extended RoPE for streaming reconstruction, generalizing across 14 diverse datasets
  • (MemoryVLA, 2025) added perceptual-cognitive memory to robotic VLA models with biological consolidation, achieving 26% improvement over CogACT on long-horizon tasks
  • (MemSurvey, 2025) proposed a unified four-type memory taxonomy and layered evaluation framework, identifying systemic biases in automated memory evaluation
  • (FakeMemories, 2025) demonstrated that memory injection attacks achieve >80% success rates on GPT-4o and Claude in Web3 agent scenarios
2026-01 to 2026-03 Theoretical unification, fused-kernel efficiency, and rational personalization
  • (CCC, 2026) proved an Impossibility Triangle for continual learning and showed HyperNetworks achieve near-zero forgetting via high context capacity
  • (LostMiddle, 2026) proved the U-shaped attention bias exists at initialization (before training), caused by iterated Cesàro matrix geometry rather than positional encodings
  • (LongFlow, 2026) fused KV cache eviction directly into FlashAttention kernels, achieving 11.8x throughput with 80% cache reduction for reasoning models
  • (RP-Reasoner, 2026) used Bayesian pragmatic reasoning to filter irrelevant memories, improving accuracy by 35% and resolving 80% of bad cases in production
  • El Agente Gráfico (ElAgente, 2026) embedded LLM decision-making in type-safe execution graphs with Knowledge Graph persistence, reducing scientific agent costs by 96%

🔬 Key Methods

MethodKey InnovationImproves OnPapers
KV Cache Compression & Eviction Predict which cached tokens the model will actually need for future generation, and evict the rest before or during decoding. Full KV cache retention, which grows linearly with sequence length and dominates GPU memory during inference LookaheadKV (2026), LongFlow (2026), InfLLM (2024)
Memory-Efficient Training via Gradient Projection Gradients naturally become low-rank during training; projecting them into this subspace before the optimizer step reduces state memory without restricting the model's learning capacity. Standard AdamW optimizer states (which consume 2-3x model size) and LoRA (which restricts parameters to a low-rank subspace) GaLore (2024), LISA (2024)
Personalization via Retrieval-Augmented User Modeling Treat user history as a queryable knowledge base, but compress and filter it intelligently so only preference-relevant context reaches the model. Generic LLM outputs that ignore user preferences, and naive full-history concatenation that exceeds context limits Persona-DB (2024), How Does Personalized Memory Shape... (2026), Integrating Summarization and Retrieval for... (2023)
Speculative Decoding Optimization Replace sequential token-by-token generation with parallel draft-then-verify cycles, using the model's own structure as the drafter to avoid auxiliary model overhead. Standard autoregressive decoding (one token per forward pass) and traditional speculative decoding (requires a separate trained draft model) Speculative Streaming (2025), DynaSpec (2025), PLD+: Accelerating LLM inference by... (2024)
Spatial Memory for Consistent World Generation Store an explicit 3D point cloud or geometry-indexed memory bank that serves as a persistent spatial reference, retrievable by current camera pose rather than appearance similarity. Autoregressive video models with limited temporal context windows that forget previously generated scenes upon revisiting Point3R (2025), WorldMem (2025), Memory Forcing (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
PersonaMem (Dynamic User Profiling)Multiple-Choice Accuracy~50%Know Me, Respond to Me (2025)
MT-Bench (LLM Quality after Fine-tuning)MT-Bench Score11-38% improvement over LoRALISA (2024)
∞-Bench (Long-Context Capability)Average Score22.82%InfLLM (2024)

⚠️ Known Limitations (5)

  • Personalization-factuality tension: incorporating user history often causes models to validate user misconceptions rather than stating objective truths, degrading factual reliability (affects: Retrieval-Augmented Personalization, Embedding-Based Persona Injection)
    Potential fix: Factuality-Preserving Personalized Steering (FPPS) uses lightweight probes to detect entanglement and applies adaptive hidden-state steering to restore factuality when needed
  • Memory security vulnerability: persistent memory modules in agents are unprotected attack surfaces where adversaries can plant dormant instructions that trigger unauthorized actions in future sessions (affects: Retrieval-Augmented Personalization, Memory-Augmented Agents)
    Potential fix: Fine-tuning-based defenses significantly reduce attack success (from ~85% to <10%) while preserving utility; activation fingerprinting (Clasp) can detect poisoned tokens via internal activation patterns
  • Evaluation fragmentation: memory benchmarks conflate retrieval quality with generation faithfulness, and automated judges suffer from position/order/self-preference biases that produce spurious significance (affects: Memory Evaluation Frameworks)
    Potential fix: The unified memory quadruple taxonomy and three-setting parallel evaluation protocol help decouple internal capability from external information availability; constraint-consistency metrics avoid length bias
  • Hardware memory wall: the widening gap between compute scaling and memory bandwidth scaling means inference optimizations hit fundamental physical limits, particularly for decoder-only architectures (affects: KV Cache Compression & Eviction, Speculative Decoding Optimization)
    Potential fix: Processing-in-Memory architectures (UPMEM, PIM) and in-memory computing with memristive devices offer potential by performing computation where data resides, eliminating the transfer bottleneck
  • Cognitive memory collapse: while models can recall explicit facts reasonably well (60-70%), they fail dramatically (30-50%) when required to apply implicit constraints or reason about evolved user states (affects: Memory Evaluation Frameworks, Retrieval-Augmented Personalization)
    Potential fix: Multi-turn RL-based alignment (RLPA) and causal preference modeling (NextQuill) show promise by training models to explicitly track and update user state representations
📚 View major papers in this topic (10)

💡 Shifting from category-based analysis to cross-cutting themes, we begin with Long-context Memory Management, which tackles the infrastructure-level challenge of KV cache compression, context window extension, and efficient attention mechanisms that underpin all higher-level memory architectures.

🧩

Long-context Memory Management

What: This topic covers techniques for managing memory in large language models when processing long contexts, including KV cache compression and eviction, context window extension, efficient attention mechanisms, position encoding strategies, and agent memory systems.

Why: As LLMs are deployed in agentic systems, multi-turn conversations, and document-intensive tasks, the quadratic cost of attention and linear growth of the KV cache create severe computational and memory bottlenecks that limit throughput, context length, and deployment on resource-constrained devices.

Baseline: The standard approach stores all key-value pairs from every token in GPU memory and performs full self-attention over the entire history at each generation step, with contiguous memory allocation and fixed positional encodings (e.g., RoPE).

  • KV cache memory grows linearly with sequence length and batch size, quickly exhausting GPU memory for long contexts
  • Identifying which tokens are important for future generation is fundamentally difficult since the model cannot foresee upcoming queries
  • Compressing or evicting context risks losing critical information needed for downstream reasoning, especially in multi-hop tasks
  • Position encodings trained on short sequences fail to generalize to longer contexts, causing out-of-distribution attention patterns

🧪 Running Example

❓ A user has a 681-turn coding session with an AI assistant, accumulating tool definitions, code outputs, and debugging history. They ask: 'Can you refactor the authentication module we discussed 200 turns ago, using the pattern from the config file we reviewed 50 turns back?'

Baseline: A standard LLM would either exceed its context window and lose the early discussion entirely, or store all 681 turns in the KV cache, consuming massive GPU memory. In practice, 21.8% of stored tokens (tool schemas, stale outputs) are structural waste that degrades attention quality and increases latency quadratically.

Challenge: The model must selectively remember specific technical details from turn 481 and turn 631, while forgetting thousands of intermediate tool calls, error messages, and irrelevant code snippets—mimicking human working memory rather than a tape recorder.

✅ Demand Paging (Pichay): Treats the context as L1 cache, evicting cold content (stale tool outputs, unused schemas) to a backing store and leaving small retrieval handles. Reduces context consumption by 93% while maintaining operation through fault-driven pinning.
✅ Self-Context Engineering (StateLM): Equips the model with a deleteContext tool, allowing it to actively distill information into notes and remove raw sources, maintaining a 'sawtooth' context profile that stays within budget while retaining key facts.
✅ KV Cache Eviction (LookaheadKV): Uses learnable lookahead tokens to predict which cached key-value pairs will be needed by future queries, evicting unimportant entries before decoding begins, reducing cache size with minimal accuracy loss.

📈 Overall Progress

The field shifted from passive full-cache retention to intelligent, learned memory management where models actively decide what to remember, compress, or forget.

📂 Sub-topics

KV Cache Compression & Eviction

8 papers

Methods that reduce the size of the key-value cache by scoring token importance and selectively evicting or merging less important entries, enabling long-context inference within fixed memory budgets.

Importance-based eviction Structure-aware chunking Lookahead prediction Dynamic memory compression

Serving & Memory Management Infrastructure

6 papers

Systems-level approaches that apply operating system concepts (virtual memory, paging, demand loading) to manage KV cache allocation across distributed GPU and CPU resources.

PagedAttention Virtual memory mapping Demand paging Elastic memory pools

Agent Memory & Context Management

8 papers

Approaches that equip LLM agents with active memory management capabilities, using reinforcement learning or learned policies to decide what to store, compress, or discard during long-horizon tasks.

RL-based memory overwrite Self-context engineering Summarization-augmented policy optimization Adaptive omission

Memory-Augmented Architectures

10 papers

Novel neural architectures that extend Transformers with explicit memory modules, including latent-space memory banks, hierarchical attention, associative memories, and external memory retrieval mechanisms.

Latent-space memory Hierarchical autoregressive modeling Associative memory networks Block-level context memory

Position Encoding & Attention Optimization

5 papers

Techniques that improve how LLMs encode token positions and allocate attention, enabling better generalization to longer contexts and reducing distraction from irrelevant information.

Contextualized positional encoding Token re-positioning Retrieval head optimization Entropy-aware parallel encoding

Personalization & Long-term User Memory

5 papers

Methods and benchmarks for tracking and leveraging evolving user preferences, traits, and personas across long conversation histories to deliver personalized responses.

Cognitive dual-memory Implicit preference inference Agentic memory with RL Dynamic user profiling

Context Compression & Summarization

7 papers

Approaches that compress long contexts through summarization, soft token compression, visual rendering, or task-aware KV cache distillation to fit more information within limited context windows.

Task-aware KV compression Selective soft compression Visual code compression Gist memory

Memory Frameworks, Taxonomies & Evaluation

5 papers

Surveys, taxonomies, and evaluation frameworks that formalize LLM memory types, define atomic operations, and provide benchmarks for measuring memory utilization capabilities.

Operational taxonomy Memory quadruple framework Programmable test generation

💡 Key Insights

💡 KV cache eviction is most effective when aligned with future decoding patterns, not just past attention scores.

💡 Models trained with RL to manage their own memory can extrapolate to context lengths 10-400x beyond training.

💡 OS concepts (paging, virtual memory, demand loading) translate remarkably well to LLM memory management.

💡 Position encoding matters more than semantic content for KV cache importance scoring during prefill.

💡 Frontier models achieve only ~50% accuracy on tracking evolving user preferences, revealing a major capability gap.

💡 Compressing context often improves performance by removing distracting information, not just saving memory.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research evolved from systems-level innovations (PagedAttention, vAttention) through architectural memory augmentation (LM2, InfLLM) to the current frontier where RL-trained agents actively curate their own working memory, converging systems engineering with learned intelligence.

2023-05 to 2023-09 Foundational systems: OS-inspired KV cache management and early memory-augmented LLMs
  • PagedAttention (vLLM, 2023) introduced virtual memory paging for KV cache, reducing waste from 60-80% to under 4% and improving serving throughput 2-4x
  • (LongMem, 2023) pioneered decoupled memory architecture with a frozen backbone and trainable SideNet, achieving state-of-the-art on the ChapterBreak benchmark
  • (CMANPs, 2023) proposed constant-memory attention blocks using reformulated cross-attention as a rolling average operation
2024-01 to 2024-10 KV cache compression matures and memory-augmented architectures diversify
  • (DMC, 2024) taught models to dynamically compress their own KV cache via retrofitting, achieving 350-700% throughput gains on H100 GPUs
  • (InfLLM, 2024) demonstrated training-free extrapolation to 1M tokens by offloading KV blocks to CPU with representative token retrieval
  • (ReadAgent, 2024) introduced human-inspired gist memory that extended effective context by 3.5x-20x while surpassing full-context baselines
  • vAttention (vAttention, 2024) replaced PagedAttention with CUDA VMM-based demand allocation, improving throughput by up to 1.99x without kernel rewrites
  • (NAMMs, 2024) evolved neural memory managers that outperformed full-context Llama-3-8B by 11% on LongBench while reducing cache size
2025-01 to 2025-05 Architectural innovation in position encoding, memory models, and personalization benchmarks
  • (TAPE, 2025) introduced contextualized equivariant positional encoding that updates positions layer-by-layer, achieving state-of-the-art perplexity of 7.063 on PG-19 at 8K length
  • LM2 (LM2, 2025) added dual-stream gated memory to Transformers, outperforming RMT by 37.1% and improving MMLU by 5.0% over vanilla Llama-3.2
  • (RLMs, 2025) enabled symbolic recursion over prompts via a REPL environment, outperforming GPT-5 by 28.4% on long-context tasks
  • (RePo, 2025) learned content-aware token positions, improving RULER scores by +11.04 points by reducing extraneous cognitive load
  • (PersonaMem, 2025) revealed frontier models achieve only ~50% accuracy on evolving persona tracking across 1M-token histories
2025-06 to 2025-12 RL-driven agents learn to manage their own memory; comprehensive frameworks and benchmarks emerge
  • (MemAgent, 2025) used RL to train memory overwrite, achieving >95% accuracy at 512K tokens and extrapolating to 3.5M tokens with linear complexity
  • (CAT, 2025) matched dense transformer quality while being 1.4-3x faster and 2-9x more memory efficient via parallel chunk compression
  • Memory Mosaics v2 (Memory Mosaics, 2025) scaled associative memory networks to 10B parameters, outperforming Transformers by 12-15% on multi-document QA
  • PersonaMem-v2 (PersonaMem-v2, 2025) demonstrated RL-trained agentic memory outperforming GPT-5 on implicit personalization while using 16x fewer tokens
  • (Rethinking Memory, 2025) formalized six atomic memory operations and the Relative Citation Index for trend analysis
2026-01 to 2026-03 Convergence of systems and intelligence: structure-aware eviction, demand paging, and self-context engineering
  • (StateLM, 2026) introduced the Pensieve paradigm where models actively delete their own context, achieving 52% accuracy on BrowseComp-Plus vs 5% for standard LLMs
  • (Pichay, 2026) built a complete demand-paging system for LLM context, reducing consumption by 93% with a 0.025% page fault rate across 1.4M simulated evictions
  • (LycheeCluster, 2026) combined structure-aware chunking with hierarchical KV indexing for 3.6x end-to-end inference speedup over full attention
  • (LookaheadKV, 2026) reduced eviction cost by 14.5x using learnable tokens that predict future attention patterns with negligible overhead
  • (LongFlow, 2026) achieved 11.8x throughput for reasoning models by fusing KV eviction directly into FlashAttention kernels
  • (MemPO, 2026) optimized memory as an intrinsic RL action with dual rewards, gaining +25.98% F1 while cutting tokens by 67%

🔬 Key Methods

MethodKey InnovationImproves OnPapers
KV Cache Eviction with Importance Scoring Predict which cached tokens the model will actually need during generation, and discard the rest to fit within a fixed memory budget. Full KV cache retention (which grows linearly with sequence length) and simple heuristics like keeping only recent tokens (sliding window) LycheeCluster (2026), LookaheadKV (2026), Where Matters More Than What:... (2026), Dynamic Memory Compression (2024)
Virtual Memory and Paging for KV Cache Treat the KV cache like OS virtual memory: allocate on demand, page to disk when cold, and share across requests via reference counting. Static contiguous memory allocation that wastes 60-80% of GPU memory due to fragmentation and over-provisioning Efficient Memory Management for Large... (2023), The Missing Memory Hierarchy: Demand... (2026), vAttention: Dynamic Memory Management for... (2024), MemServe (2024)
RL-based Active Memory Management Let the model learn through trial-and-error rewards what information to keep, compress, or discard from its working memory. Fixed external memory modules and rule-based context truncation that cannot adapt to task-specific information needs MemAgent (2025), StateLM (2026), MemPO (2026), Mem-α: Training LLMs to Manage... (2025)
Hierarchical and Compressed Attention Compress past context into compact hierarchical representations so that each new token attends to summaries rather than the full history. Full self-attention (quadratic cost) and simple sliding window attention (which loses distant context entirely) Compress & Attend Transformer (2025), PHOTON (2025), Memory Mosaics at scale (2025), Slow-Fast Inference (2026)
Latent-Space Memory Augmentation Maintain a persistent memory bank of compressed past states that the model can query via learned retrieval, decoupling memory capacity from context window size. Context window limits that force the model to either truncate history or process prohibitively long sequences LM2 (2025), M+: Extending MemoryLLM with Scalable... (2025), InfLLM (2024), Language Models Augmented with Long-Term... (2023)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
RULER (Long-Context Recall)Accuracy>95%MemAgent (2025)
Needle-in-a-Haystack (NIAH)Accuracy100%InfLLM (2024)
BABILong (Multi-step Reasoning over Long Context)Average Accuracy+37.1% over RMTLM2 (2025)

⚠️ Known Limitations (5)

  • Most KV cache eviction methods rely on prompt-phase attention patterns that poorly predict actual decoding-time importance, causing loss of critical information for complex reasoning tasks. (affects: KV Cache Eviction with Importance Scoring, Hierarchical and Compressed Attention)
    Potential fix: DapQ and LookaheadKV address this by simulating future query positions or using learnable lookahead tokens to better predict decoding-time importance.
  • RL-based memory methods require extensive training and careful reward design; the dense memory-quality reward often needs ground-truth answers, limiting applicability to tasks without clear correctness signals. (affects: RL-based Active Memory Management)
    Potential fix: SUPO's joint optimization of summarization and task performance within the MDP framework and Agent-Omit's dual-sampling strategy offer paths toward more generalizable RL-based approaches.
  • Memory-augmented architectures require modifications to the base model or additional training, making them harder to apply to existing deployed models compared to training-free methods. (affects: Latent-Space Memory Augmentation, Hierarchical and Compressed Attention)
    Potential fix: Training-free approaches like InfLLM and Slow-Fast Inference demonstrate effective long-context handling without architectural changes, though they may not match the performance ceiling of trained approaches.
  • Personalization benchmarks reveal that models struggle with implicit preference tracking and dynamic persona updates, with accuracy dropping to 30-50% on tasks requiring integration of new information with historical context. (affects: Personalization via Long-Context Memory)
    Potential fix: PersonaMem-v2's RL-trained agentic memory approach shows promise, achieving 55% accuracy while using 16x fewer tokens by maintaining compact, dynamically updated user profiles.
  • Evaluation frameworks remain fragmented: most benchmarks test retrieval (needle-in-a-haystack) but not complex operations like state tracking, editing, or forgetting, making it difficult to compare methods holistically. (affects: KV Cache Eviction with Importance Scoring, Latent-Space Memory Augmentation, RL-based Active Memory Management)
    Potential fix: The programmable test framework from paper 1232 and the layered evaluation protocol from paper 8490 offer more comprehensive approaches that decompose memory into atomic capabilities.
📚 View major papers in this topic (9)

💡 The infrastructure for handling extended sequences directly enables Conversational and Dialogue Memory, where the challenge shifts from raw token management to maintaining persistent user preferences, persona traits, and contextual facts across multi-session dialogues spanning weeks or months.

🔬

Conversational and Dialogue Memory

What: Research on enabling AI systems to maintain persistent, coherent memory across extended multi-turn and multi-session conversations, including retaining user preferences, persona traits, and contextual facts over time.

Why: As LLM-based assistants become daily-use tools, users expect them to remember past interactions, adapt to personal preferences, and maintain consistency—capabilities that fixed context windows fundamentally cannot support.

Baseline: The conventional approach feeds recent conversation history directly into the LLM's context window or uses simple top-k embedding similarity retrieval over stored dialogue, which fails as conversations grow beyond the context limit and retrieval becomes imprecise.

  • Context window limitations prevent LLMs from accessing full conversation histories spanning weeks or months of interaction
  • Retrieving the right memory at the right time requires understanding query intent, not just surface-level keyword or embedding similarity
  • User preferences are often expressed implicitly through behavior rather than explicit statements, making them difficult to detect and store
  • Memory must be dynamically updated—adding, merging, and deleting information—as user facts and preferences evolve over time

🧪 Running Example

❓ A user tells their AI assistant 'Can you suggest a restaurant for tonight?' after having mentioned in a conversation three weeks ago that they recently became vegetarian and prefer quiet places.

Baseline: A standard LLM with a fixed context window has no access to the three-week-old conversation. Even a basic RAG system may fail because the query 'suggest a restaurant' has low semantic similarity to the earlier discussion about dietary changes, retrieving irrelevant past exchanges instead.

Challenge: The user's vegetarian preference was mentioned implicitly during a health discussion, not as a direct 'I am vegetarian' statement. Retrieving this requires understanding that dietary preferences are relevant to restaurant recommendations—a semantic leap that surface-level retrieval misses.

✅ MemGPT (Virtual Context Management): Treats the context window as 'RAM' and external storage as 'disk,' allowing the agent to autonomously page in the user's dietary preferences from long-term storage when the restaurant query triggers a memory search.
✅ Reflective Memory Management (RMM): During the original health conversation, RMM decomposes the session into atomic topics including 'user became vegetarian,' storing it as a discrete memory unit. Its trained reranker learns that dietary facts are relevant to food queries, surfacing this memory reliably.
✅ Mem0 (Dynamic Memory Management): Extracts the vegetarian preference as a salient fact during the original conversation and stores it as a persistent memory entry. When the restaurant query arrives, semantic retrieval over the structured memory store surfaces this fact directly.
✅ PersonaMem-v2 (Implicit Persona Learning): Uses reinforcement learning to train a compact memory summary that captures implicitly revealed preferences like dietary changes, enabling personalized recommendations without re-reading the full conversation history.

📈 Overall Progress

The field evolved from fixed context windows to autonomous, self-improving memory systems that organize, retrieve, and evolve conversational knowledge using OS and graph paradigms.

📂 Sub-topics

Memory Architecture and Management

10 papers

Systems that structure and manage conversational memory using hierarchical, graph-based, or OS-inspired architectures to enable persistent, organized storage and efficient retrieval across long time horizons.

Virtual Context Management Graph-Based Memory Hierarchical Memory Tiers Sentence Graph Memory

Personalization and Preference Learning

9 papers

Methods for learning, storing, and applying user-specific preferences and communication styles, including parametric fine-tuning, causal modeling, and embedding-based approaches.

Generation-Calibrated Retrieval Causal Preference Modeling Difference-aware Embeddings Implicit Persona Learning

Retrieval-Augmented Dialogue Memory

3 papers

Approaches that enhance multi-turn dialogue by dynamically retrieving relevant context from conversation history, social interactions, or structured memory stores using tool-augmented or history-aware retrieval strategies.

Tool-Augmented Memory Retrieval Dynamic Historical Context RAG Social-RAG

Evaluation Benchmarks and Datasets

4 papers

Benchmarks and evaluation frameworks that measure LLM performance on long-term conversational memory, preference adherence, multi-turn instruction following, and cognitive reasoning over dialogue history.

LoCoMo Pipeline Constraint-Consistency Evaluation PrefEval Protocol MultiTurnInstruct

💡 Key Insights

💡 OS-inspired memory hierarchies with self-directed paging enable unbounded conversation length without losing critical context.

💡 Graph-based memory structures outperform flat vector stores for multi-hop reasoning across interconnected user facts.

💡 Implicit user preferences are far harder to capture than explicit statements, with most models scoring below 10% zero-shot.

💡 Reinforcement learning can train compact memory summaries that outperform frontier models using 16x fewer tokens.

💡 Cognitive memory evaluation reveals that factual recall scores drastically overestimate true conversational understanding.

💡 Reflective memory that learns from its own retrieval successes adapts to individual users without human annotation.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Research progressed from pioneering OS-inspired memory architectures (MemGPT, 2023) through graph-based and reflective memory systems (2024-2025) to implicit personalization via reinforcement learning and cognitive evaluation frameworks that expose fundamental gaps between factual recall and true preference understanding (2025-2026).

2023-09 to 2024-02 Foundational architectures for persistent conversational memory and early personalization
  • (MemGPT, 2023) pioneered OS-inspired virtual context management, achieving +60% accuracy improvement on deep memory retrieval tasks
  • (Pearl, 2023) introduced generation-calibrated retrieval for personalized writing, training retrievers whose scores correlate with downstream generation quality
  • (DPeM, 2023) applied dual-process memory with working, short-term, and long-term tiers to medical assistant personalization
  • (LoCoMo, 2024) established the first very long-term dialogue benchmark (300+ turns), revealing that LLMs lag behind humans by 56-73% on memory tasks
2024-02 to 2025-03 Benchmarking conversational memory and emergence of graph-based and preference-aware systems
  • (EMG, 2024) introduced editable memory graphs with RL-driven traversal, outperforming baselines by ~10.6% on QA after weeks of continuous edits
  • (PrefEval, 2025) revealed that preference following accuracy falls below 10% for most models in zero-shot settings
  • (RMM, 2025) introduced reflective memory with prospective topic decomposition and retrospective RL-trained reranking, gaining +10% on LongMemEval
  • (A-Mem, 2025) applied Zettelkasten-inspired atomic notes with self-evolving links, improving F1 by 35% over LoCoMo baselines while reducing token usage by 85-93%
2025-04 to 2026-03 Scalable memory systems, implicit personalization, and cognitive evaluation
  • Mem0 (Mem0, 2025) introduced dynamic dual-phase memory extraction with a graph variant, achieving 26% improvement over baselines with 91% latency reduction
  • PersonaMem-v2 (PersonaMem-v2, 2025) demonstrated that RL-trained agentic memory outperforms GPT-5 on implicit personalization using 16x fewer tokens
  • (SGMem, 2025) combined sentence-level graphs with joint indexing of raw dialogue and generated summaries, outperforming LightRAG on LongMemEval
  • (MemoryOS, 2025) extended OS-inspired memory with segmented paging and heat-based eviction, achieving +49% F1 on LoCoMo
  • (LoCoMo-Plus, 2026) exposed that cognitive memory performance collapses compared to factual recall, and that task disclosure artificially inflates scores
  • (TA-Mem, 2026) transformed retrieval into an agentic tool-selection task, gaining +7 F1 over Mem0 on temporal QA while using 4x fewer tokens than full-context methods

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Virtual Context Management Let the LLM manage its own memory like an operating system manages virtual memory, swapping data between limited fast context and unlimited external storage. Fixed context window approaches that truncate or summarize old conversation history MemGPT (2023), MemoryOS (2025), LLM-based (2023)
Graph-Based Memory Organization Structure conversational memories as interconnected graphs to enable relational reasoning and precise multi-hop retrieval that flat vector stores cannot support. Flat vector-based memory stores that retrieve isolated facts without relational context Crafting Personalized Agents through Retrieval-Augmented... (2024), Mem0 (2025), SGMem (2025), Agentic Memory (2025)
Reflective and Adaptive Memory Management Use the LLM's own memory usage patterns as feedback to train a reranker that adapts retrieval to specific user interaction styles without human labels. Static retrieval methods that use fixed similarity thresholds regardless of user or query type In Prospect and Retrospect: Reflective... (2025)
Tool-Augmented and Dynamic Retrieval Expose multiple memory indices as callable tools, letting the LLM agent decide the retrieval strategy rather than forcing all queries through a single embedding-similarity pipeline. Single-index top-k similarity retrieval that treats all query types identically TA-Mem (2026), DH-RAG (2025), Social-RAG (2025)
Parametric Personalization via Fine-tuning Encode user conversation history and preferences directly into model parameters using efficient fine-tuning, eliminating the need for runtime retrieval. Retrieval-augmented generation which requires external storage management and adds retrieval latency On the Way to LLM... (2024), Enabling On-Device Large Language Model... (2023), Latent Inter-User Difference Modeling for... (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
LoCoMoF1 / BLEU-1+49.11% F1 over baselinesMemoryOS (2025)
LongMemEvalAccuracy70.4%In Prospect and Retrospect: Reflective... (2025)
PrefEvalPreference Following AccuracySignificant improvement over zero-shotDo LLMs Recognize Your Preferences?... (2025)

⚠️ Known Limitations (5)

  • Scalability of memory operations: As conversation histories grow to thousands of sessions, memory extraction, graph updates, and retrieval become computationally expensive, limiting real-time responsiveness. (affects: Graph-Based Memory Organization, Virtual Context Management, Reflective and Adaptive Memory Management)
    Potential fix: Mem0 addresses latency with efficient dual-phase pipelines (91% p95 latency reduction), and SGMem uses lightweight NLTK-based graph construction instead of expensive LLM-based entity-relation extraction.
  • Evaluation gaps for implicit and cognitive memory: Most benchmarks test explicit factual recall, which overestimates real-world performance where preferences are implicit and constraints require inference beyond lexical overlap. (affects: Causal and Implicit Preference Modeling, Parametric Personalization via Fine-tuning)
    Potential fix: LoCoMo-Plus introduces constraint-consistency evaluation that measures behavioral adherence rather than string matching, providing a more realistic assessment of memory capabilities.
  • Privacy and on-device constraints: Storing detailed user conversation histories raises privacy concerns, and on-device personalization is limited by storage capacity and the inability to offload data for cloud-based annotation. (affects: Parametric Personalization via Fine-tuning, Virtual Context Management)
    Potential fix: SDSS proposes self-supervised data selection with entropy-based filtering and local synthetic data augmentation, avoiding cloud offloading while maintaining personalization quality.
  • Memory staleness and conflict resolution: As user preferences evolve over time, stored memories can become outdated or contradictory, and most systems lack principled mechanisms to detect and resolve conflicts between old and new information. (affects: Graph-Based Memory Organization, Parametric Personalization via Fine-tuning, Tool-Augmented and Dynamic Retrieval)
    Potential fix: A-Mem implements memory evolution where new experiences trigger rewrites of old memory contexts, and Mem0 uses an update/delete pipeline to manage changing facts.
  • Degradation under adversarial and conflicting instructions: Models that perform well on standard memory retrieval show severe degradation when faced with adversarial questions or entangled multi-turn constraints, indicating brittle memory integration. (affects: Virtual Context Management, Causal and Implicit Preference Modeling)
    Potential fix: MultiTurnInstruct identifies that stronger reasoning does not guarantee better conflict resolution, suggesting that dedicated training on constraint-conflict scenarios may be needed beyond general instruction tuning.
📚 View major papers in this topic (9)

💡 The accumulation of user-specific knowledge across extended dialogues inevitably raises the challenge of Continual Learning and Catastrophic Forgetting—how models can sequentially absorb new information from evolving interactions without overwriting the knowledge they have already acquired.

🏆

Continual Learning and Catastrophic Forgetting

What: Continual learning studies how models can sequentially acquire new knowledge or skills from non-stationary data streams without losing previously learned information, a failure mode known as catastrophic forgetting.

Why: Real-world AI systems must adapt to evolving data, new tasks, and changing user needs over time. Without continual learning, models require costly full retraining or suffer degraded performance on earlier capabilities.

Baseline: The conventional approach is sequential fine-tuning, where a model is updated on new task data using standard gradient descent. This typically causes catastrophic forgetting because new parameter updates overwrite weights that encoded prior knowledge.

  • Stability-plasticity dilemma: balancing the ability to learn new information (plasticity) while retaining old knowledge (stability)
  • Scalability: maintaining performance as the number of sequential tasks grows into the hundreds or thousands without proportional growth in memory or parameters
  • Task-agnostic inference: performing well without knowing which task a given input belongs to (class-incremental setting), which is far harder than task-incremental settings
  • Evaluation beyond accuracy: measuring not just final performance but also backward transfer (forgetting), forward transfer, and computational overhead

🧪 Running Example

❓ An LLM customer-support chatbot trained on electronics troubleshooting must now also handle furniture assembly inquiries without forgetting electronics knowledge.

Baseline: Standard fine-tuning on the furniture dataset causes the model to lose electronics-specific terminology, troubleshooting flows, and product knowledge. After training on furniture, accuracy on electronics queries drops from 92% to below 40%.

Challenge: The model has limited capacity to store both domains. Furniture training updates the same parameters that encoded electronics knowledge, and without access to the original electronics training data, there is no way to remind the model of what it once knew.

✅ MEGa (Memory Embedded in Gated LLMs): Assigns an independent, frozen LoRA adapter for each knowledge domain. At inference, a gating mechanism matches the user query to stored domain keys and activates only the relevant adapter, so electronics and furniture knowledge never interfere.
✅ MSSR (Memory-Aware Adaptive Replay): Models the forgetting curve for each training example and schedules replay of electronics samples at expanding intervals during furniture training, prioritizing samples most at risk of being forgotten.
✅ CoRe (Continual Representation Learning): Operates in representation space rather than weight space, applying orthogonality constraints so that furniture-specific representation updates cannot drift into the subspace used by electronics knowledge.
✅ Context Channel Capacity Analysis: Predicts in advance whether the chosen architecture has sufficient context channel capacity to support zero forgetting across both domains, guiding the selection of an architecture (e.g., HyperNetworks) that provably avoids the information bottleneck.

📈 Overall Progress

The field has shifted from treating forgetting as an unavoidable side effect to be mitigated, toward provably forgetting-free architectures guided by information-theoretic bounds and modular memory systems.

📂 Sub-topics

Theoretical Frameworks and Taxonomies

2 papers

Formal information-theoretic analyses and comprehensive surveys that explain why forgetting occurs and categorize the landscape of continual learning strategies.

Context Channel Capacity Unified Lifelong Learning Taxonomy

Parameter-Efficient Continual Adaptation

5 papers

Methods that freeze the pretrained backbone and apply lightweight modifications (gating, representation interventions, classifier alignment) to adapt to new tasks while preserving old knowledge.

Channel-wise Gated Modulation CoRe LCA SEQ*

Dynamic Routing and Architecture Growth

3 papers

Approaches that expand or dynamically route through network components to accommodate new tasks, including energy-based routing, soft masking, and adaptive network growth.

Energy-Based Associative Routing MAGIC Net CPNS Regularization

Replay and Rehearsal Optimization

3 papers

Methods that store and selectively replay past experiences to mitigate forgetting, with innovations in replay scheduling, adversarial diversification, and dual-buffer strategies.

MSSR ADRM ARROW

Continual Knowledge Editing for LLMs

4 papers

Techniques for injecting, updating, or correcting knowledge in large language models over long sequences of edits without degrading prior knowledge or general capabilities.

SoLA MEGa MEMOIR Language-Controlled Neural Memory

Agent Memory and Continual Adaptation

4 papers

Memory systems for autonomous agents that learn reusable workflows, curate episodic experiences, and adapt retrieval strategies without fine-tuning the underlying LLM.

Agent Workflow Memory Experience-Augmented Hierarchical Planning Memento Panini GSW

💡 Key Insights

💡 Pre-trained backbones retain more knowledge than assumed; forgetting is often a classifier alignment problem, not a representation problem.

💡 Information-theoretic bounds prove that sequential state-based learners face an impossibility triangle between zero forgetting, online learning, and finite parameters.

💡 Modular frozen adapters with routing mechanisms scale to thousands of sequential edits without interference between updates.

💡 Representation-space interventions with orthogonality constraints outperform weight-space fine-tuning across all incremental learning settings.

💡 Agent memory systems that optimize retrieval rather than model parameters enable continual improvement without any fine-tuning.

💡 Adversarial diversification of replay buffers is more effective than increasing buffer size for combating memory overfitting.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work focused on regularization and simple replay, but pre-trained model dominance revealed that forgetting is often a classifier alignment issue rather than a representation problem. The latest wave leverages modular frozen adapters, energy-based routing, and representation-space interventions to decouple stability from plasticity, while agent memory systems extend continual learning beyond supervised settings into interactive, open-ended environments.

2023-01 to 2024-05 Foundational re-examination of forgetting assumptions and establishment of taxonomic frameworks
  • SEQ* (SEQ*, 2023) revealed that pre-trained language model backbones retain knowledge through sequential training, challenging the assumption that catastrophic forgetting is inevitable and showing that simple freezing strategies outperform complex methods
  • (Lifelong Learning Primer, 2024) established a unified taxonomy categorizing strategies into regularization, memory, and architecture families with formal metrics for forgetting and intransigence
2024-05 to 2024-10 Emergence of agent memory systems and robust replay strategies
  • (ADRM, 2024) applied adversarial perturbations to replay buffers, achieving +32.35% robustness improvement on corrupted data compared to standard rehearsal methods
  • (ICU, 2024) introduced iterative contrastive unlearning for selective knowledge removal without model collapse, reducing extraction likelihood from 0.40 to 0.04
  • (AWM, 2024) pioneered workflow induction for agents, enabling a +51.1% relative improvement on WebArena through reusable parameterized routines
  • (Agent S, 2024) combined narrative and episodic memory in a hierarchical planning framework, achieving 83.6% relative improvement on OS-level task automation
2025-04 to 2025-12 Scaling continual learning to LLMs with modular editing and memory-augmented agents
  • (MEGa, 2025) introduced per-memory LoRA adapters with context-key gating, maintaining >90% recall after 50 sequential knowledge injection tasks where baselines collapsed to <10%
  • (Memento, 2025) formalized agent learning as a Memory-augmented MDP with online RL-based case retrieval, achieving 87.88% Pass@3 on GAIA without any LLM fine-tuning
  • (MEMOIR, 2025) scaled lifelong model editing to 15,000 sequential edits on LLaMA-3-8B using sparse residual memory with TopHash retrieval
2026-02 to 2026-03 Theoretical breakthroughs, representation-level methods, and controllable memory systems
  • (CCC, 2026) proved the Impossibility Triangle: zero forgetting, online learning, and finite parameters cannot coexist for sequential state-based learners, establishing information-theoretic lower bounds for the field
  • (RwF, 2026) introduced energy-based Hopfield routing for online continual learning, achieving 74.09% accuracy on Split-ImageNet-R with only 2.1% additional parameters
  • (Panini, 2026) replaced text-chunk retrieval with structured semantic workspaces and reasoning chain retrieval, reducing token usage by 2-30x while improving QA accuracy by 5-7%
  • (CoRe, 2026) shifted fine-tuning from weight space to representation space with orthogonality constraints, achieving state-of-the-art results across task-, domain-, and class-incremental settings
  • (LCA, 2026) solved the classifier-backbone mismatch problem through Gaussian-based synthetic sample generation and incremental PEFT merging, leading on 7 benchmark datasets

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Information-Theoretic Forgetting Analysis Zero forgetting requires the architecture's context channel capacity to equal or exceed the entropy of the task distribution. Empirical intuitions about why some methods forget and others do not, replacing ad-hoc explanations with provable information-theoretic bounds Context Channel Capacity (2026), An Introduction to Lifelong Supervised... (2024)
Frozen Backbone with Lightweight Adaptation Keep the pretrained backbone frozen and learn only lightweight modifiers that steer existing features toward new tasks without overwriting shared representations. Full fine-tuning and traditional parameter-efficient methods (LoRA, adapters, prompts) that update backbone parameters and suffer from representation drift LCA (2026), Representation Finetuning for Continual Learning (2026), Gated Adaptation for Continual Learning... (2026), Learn or Recall? Revisiting Incremental... (2023)
Dynamic Routing and Architecture Growth Decouple the instant routing decision (which subnetwork or prompt to use) from the slow gradient-based parameter updates, enabling immediate adaptation to distribution shifts. Static prompt pools and fixed architecture methods that cannot adapt quickly enough for online learning or grow unnecessarily Routing without Forgetting (2026), Don't Look Back in Anger:... (2026), Causally Sufficient and Necessary Feature... (2026)
Optimized Replay and Rehearsal Schedule replay based on estimated forgetting risk per sample rather than fixed intervals or random selection, and diversify the limited replay buffer to prevent overfitting. Fixed-interval random replay and simple experience replay buffers that waste compute on already-remembered examples or overfit to stored samples MSSR (2026), Adversarially Diversified Rehearsal Memory (ADRM) (2024), ARROW (2026)
Modular Knowledge Editing for LLMs Treat each knowledge edit as a separate frozen module retrieved by input similarity, so edits never interfere with each other and can be individually added or removed. Global parameter editing methods (ROME, MEMIT) that degrade rapidly after hundreds of edits due to parameter interference MEMOIR (2025), MEGa (2025), Reversible Lifelong Model Editing via... (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
Split-ImageNet-RFinal Average Accuracy74.09%Routing without Forgetting (2026)
Split-MNIST (Continual Learning)Accuracy / Forgetting Rate98.8% accuracy, ~0% forgettingContext Channel Capacity (2026)
WebArena (Agent Task Completion)Success Rate+51.1% relative improvementAgent Workflow Memory (2024)

⚠️ Known Limitations (5)

  • Most continual learning methods are evaluated on relatively short task sequences (5-20 tasks), making it unclear how they perform at the scale of hundreds or thousands of tasks encountered in real deployment scenarios. (affects: Frozen Backbone with Lightweight Adaptation, Dynamic Routing and Architecture Growth, Optimized Replay and Rehearsal)
    Potential fix: MEMOIR demonstrates scaling to 15,000 edits using sparse residual memory, suggesting that modular approaches with efficient retrieval may overcome this limitation.
  • Routing and modular methods require storing a growing number of modules (LoRA adapters, keys, masks), creating a linear memory overhead that may become prohibitive for resource-constrained environments like edge devices. (affects: Modular Knowledge Editing for LLMs, Dynamic Routing and Architecture Growth, Agent Workflow and Episodic Memory)
    Potential fix: Module compression, periodic consolidation of similar modules, and sparse activation patterns (as in MEMOIR's TopHash) can reduce storage overhead.
  • Class-incremental learning (where task identity is unknown at inference) remains significantly harder than task-incremental settings, with performance gaps of 20+ percentage points, yet is the most realistic deployment scenario. (affects: Frozen Backbone with Lightweight Adaptation, Dynamic Routing and Architecture Growth)
    Potential fix: Context Channel Capacity analysis suggests that architectures with sufficient capacity (HyperNetworks) can close this gap; the Gradient Context Encoder reduced the gap from 23.3pp to 0.7pp on CIFAR-10.
  • Agent memory systems rely on the quality of self-evaluation and workflow abstraction, which can propagate errors if the agent incorrectly assesses its own success or extracts misleading patterns from limited experience. (affects: Agent Workflow and Episodic Memory)
    Potential fix: Memento's RL-based case retrieval optimization provides a principled way to learn which experiences are actually useful, rather than relying on heuristic self-evaluation.
  • Most methods are benchmarked in controlled settings with clearly delineated task boundaries, whereas real-world data streams often have gradual, overlapping distribution shifts without explicit task demarcations. (affects: Optimized Replay and Rehearsal, Frozen Backbone with Lightweight Adaptation, Information-Theoretic Forgetting Analysis)
    Potential fix: MAGIC Net's drift detection and adaptive strategy selection addresses this by operating on continuous streams without requiring task boundaries.
📚 View major papers in this topic (10)

💡 The stability-plasticity dilemma central to continual learning has deep roots in cognitive science, where human memory naturally balances retention and forgetting through consolidation, spreading activation, and episodic-semantic separation—principles that Cognitive and Human-like Memory research translates into practical AI architectures.

📱

Cognitive and Human-like Memory

What: Research that draws on cognitive science—particularly models of working memory, episodic/semantic memory, attention, and memory consolidation—to design memory systems for LLMs and AI agents.

Why: Standard LLMs process each input statelessly or within a fixed context window, lacking the persistent, structured memory that enables humans to accumulate experience, maintain identity, and reason over long histories.

Baseline: Conventional approaches either expand the raw context window (brute-force token concatenation) or use flat vector-store retrieval (RAG) with no cognitive structure, treating all memories uniformly regardless of type, recency, or relevance.

  • Bridging the gap between finite context windows and the need for persistent, long-term memory across sessions
  • Retrieving relevant memories when the current query has no surface-level semantic overlap with stored information (cue-trigger disconnect)
  • Balancing memory retention and forgetting to prevent redundancy, context drift, and hallucination from stale or conflicting memories
  • Designing memory architectures that support identity persistence and continuity when the underlying model is upgraded or replaced

🧪 Running Example

❓ A user has been chatting with a personal AI assistant for months. Three months ago, the user mentioned being stressed about upcoming medical board exams. Now the user asks: 'What should I watch on Netflix tonight?'

Baseline: A standard LLM with flat vector retrieval finds no semantic match between 'Netflix' and 'medical exams,' so it returns generic trending recommendations with no awareness of the user's stress or study schedule.

Challenge: The relevant memory ('stressed about exams') has zero lexical overlap with the current query ('Netflix tonight'), so keyword and embedding-based retrieval both fail. Moreover, the memory is months old and may have been lost to context window limits.

✅ Dual-Memory System (PRIME): Semantic memory has distilled months of interactions into the trait 'user is a stressed medical student,' which is surfaced regardless of query topic, leading to recommendations for light, relaxing comedies.
✅ Memory Bear (Cognitively-Grounded Memory): The sleep-like consolidation mechanism has organized user memories into a structured graph, preserving emotionally important facts while pruning trivia. The 'stressed about exams' node remains active due to emotional weight, informing the recommendation.
✅ LoCoMo-Plus Evaluation: Identifies this as a 'cue-trigger semantic disconnect' case, revealing that models relying on surface retrieval would fail to apply the implicit constraint of being considerate of the user's stress level.

📈 Overall Progress

The field evolved from simple differentiable memory lookups to sophisticated cognitive architectures with dual-memory stores, active consolidation, and controllable memory mirroring human cognition.

📂 Sub-topics

Dual-Memory and Consolidation Architectures

7 papers

Systems that explicitly separate memory into episodic and semantic stores, often incorporating sleep-like consolidation, working memory gating, and forgetting mechanisms inspired by hippocampal and neocortical processes.

Dual-Memory Systems Memory Consolidation Cognitively-Grounded Memory Orchestration Language-Controlled Neural Memory

Associative Memory Networks

3 papers

Architectures that replace standard Transformer attention with associative memory units, achieving compositional reasoning through transparent memory operations and supporting multi-level (short-term, long-term, persistent) memory stores.

Memory Mosaics End-to-End Memory Networks

Attention and Cognitive Load Optimization

4 papers

Methods that improve how models allocate attention over long contexts, drawing on cognitive load theory and working memory constraints to make attention more efficient and context-aware.

Context Re-Positioning Focus Directions Constant Memory Attention Hub Token Attention

Cognitive Memory Benchmarks and Taxonomies

4 papers

Surveys, benchmarks, and evaluation frameworks that assess memory capabilities from a cognitive science perspective, moving beyond simple factual recall to test implicit constraint adherence and cognitive memory types.

Constraint-Consistency Evaluation 3D-8Q Memory Taxonomy Cognitive Architecture Taxonomy

Cognitive Models for Agent Behavior

3 papers

Research applying cognitive memory models to agent decision-making, including navigation under memory constraints, sentence processing with finite particles, and detecting cognitive degradation in autonomous agents.

Sequential Decision Model (POMDP) Resampling-Induced Digging-In Cognitive Degradation Lifecycle

💡 Key Insights

💡 Separating memory into episodic and semantic stores consistently outperforms flat retrieval for personalization and long-horizon tasks.

💡 Active forgetting and sleep-like consolidation are essential to prevent memory bloat, context drift, and hallucination from stale information.

💡 Current memory benchmarks dramatically overestimate model capabilities by testing only explicit factual recall, not implicit constraint adherence.

💡 Associative memory networks offer a transparent, scalable alternative to Transformers with superior context extrapolation.

💡 Treating position assignment as a cognitive load optimization problem yields substantial improvements on long-context reasoning tasks.

💡 Making neural memory controllable via natural language instructions transforms memory from passive recording to active knowledge management.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (2015) established differentiable memory access; 2023-2024 introduced efficient and transparent alternatives to Transformer attention; 2025 saw an explosion of cognitive-science-inspired architectures applied to robotics, personalization, and long-context tasks; 2026 has shifted focus to evaluation reform, memory governance, and user-controllable memory systems.

2015-03 to 2015-03 Foundational memory-augmented neural networks with differentiable memory access
  • MemN2N (MemN2N, 2015) introduced end-to-end trainable memory with multi-hop soft attention, achieving 3.2% mean error on bAbI QA and establishing the paradigm of differentiable memory access
2023-05 to 2024-05 Efficient memory mechanisms and transparent associative memory architectures
  • (CMANP, 2023) achieved constant-memory attention via rolling log-sum-exp updates, enabling Neural Processes on resource-constrained devices
  • (Memory Mosaics, 2024) replaced Transformer attention with associative memory units, matching perplexity while achieving transparent predictive disentanglement
2025-01 to 2025-04 Cognitive load theory meets LLM attention, and comprehensive memory taxonomies emerge
  • (RePo, 2025) applied Cognitive Load Theory to learn content-dependent token positions, improving RULER benchmark scores by +11 points over fixed position encoding
  • (Focus Directions, 2025) identified sparse contextual heads and steerable attention vectors, boosting multi-doc QA by +7.7% EM without any training
  • Two major surveys (3D-8Q Taxonomy, 2025; Cognitive Memory Taxonomy, 2025) mapped human memory types to LLM components, establishing shared vocabulary for the field
2025-07 to 2025-12 Scaling cognitive memory to real-world applications: robotics, personalization, and agent resilience
  • (PRIME, 2025) formalized episodic-semantic memory for LLM personalization with self-distilled reasoning traces, demonstrating that semantic memory outperforms episodic for capturing user traits
  • (MemoryVLA, 2025) brought dual-stream perceptual-cognitive memory to robotics, achieving +26% improvement on real-world long-horizon tasks over the CogACT baseline
  • Memory Mosaics v2 (Memory Mosaics v2, 2025) scaled associative memory to 10B parameters and 1T training tokens, outperforming Transformers by 12-15% on multi-document QA
  • (Memory Bear, 2025) introduced sleep-based consolidation and Ebbinghaus forgetting curves, reducing inference token usage by ~90%
  • (QSAF, 2025) defined a six-stage cognitive degradation lifecycle for AI agents, identifying critical memory drift and planner entrapment vulnerabilities
2026-01 to 2026-03 Evaluation paradigm shifts, memory governance, and controllable neural memory
  • (LoCoMo-Plus, 2026) revealed that cognitive memory collapses across all models when implicit constraints are tested, fundamentally challenging existing memory evaluation
  • Tell Me What To Learn (Tell Me What To Learn, 2026) made neural memory controllable via natural language instructions, letting users specify what to remember or ignore
  • (CMA, 2026) proposed that memory constitutes the agent's identity, introducing constitutional governance and inheritance protocols for persistent digital citizens
  • (POMDP, 2026) modeled web navigation as a decision process under memory constraints, replicating human backtracking and partial scanning behaviors

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Dual-Memory Systems Splitting AI memory into experience-specific recall and abstract knowledge stores, with brain-inspired consolidation bridging the two. Flat vector-store retrieval (RAG) that treats all memories uniformly without distinguishing episodic experiences from generalized knowledge PRIME (2025), MemoryVLA (2025), Cognitive algorithms and systems of... (2026)
Cognitively-Grounded Memory Orchestration Memory systems that actively consolidate, forget, and reorganize themselves—mimicking human sleep and forgetting—rather than passively accumulating information. Static memory stores that grow without bound, causing redundancy, context drift, and increased hallucination risk Memory Bear (2025), Memory as Ontology (2026)
Associative Memory Networks Replacing opaque Transformer attention with transparent associative memory units that naturally decompose complex prediction tasks into interpretable sub-components. Standard Transformer attention, which is opaque and degrades with many in-context examples or extreme context lengths Memory Mosaics (2024), Memory Mosaics at scale (2025)
End-to-End Memory Networks Replacing hard, supervised memory lookups with soft attention over an external memory, enabling end-to-end training with multi-hop reasoning. Original Memory Networks that required strong supervision for each memory access step End-To-End (2015)
Cognitive Load-Aware Attention Treating an LLM's attention budget as analogous to human working memory capacity, and optimizing how that budget is spent on relevant versus irrelevant context. Fixed linear position encoding (e.g., RoPE) that treats all tokens as equally positioned regardless of relevance RePo (2025), Eliciting Attention on Relevant Contexts... (2025), Summarize Before You Speak with... (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
RULER (Long-Context QA)Average Accuracy / Exact Match12.3-14.8% higher than Transformers at 32k contextMemory Mosaics at scale (2025)
LIBERO (Robotic Manipulation Simulation)Success Rate (%)96.5%MemoryVLA (2025)
bAbI QA TasksMean Error Rate (%)3.2% mean errorEnd-To-End (2015)

⚠️ Known Limitations (5)

  • Cognitive memory architectures add significant engineering complexity—maintaining dual stores, consolidation pipelines, and forgetting curves requires careful tuning and may not generalize across domains without adaptation. (affects: Dual-Memory Systems, Cognitively-Grounded Memory Orchestration)
    Potential fix: Automated hyperparameter tuning for consolidation schedules and decay rates; meta-learning approaches that adapt memory parameters to domain characteristics.
  • Evaluation of cognitive memory remains inadequate—most benchmarks still rely on factual recall, and even LoCoMo-Plus covers only limited types of implicit constraints, leaving many aspects of cognitive memory untested. (affects: Constraint-Consistency Evaluation, Dual-Memory Systems)
    Potential fix: Developing richer benchmarks that test procedural memory, emotional memory, and cross-modal memory transfer, as called for by multiple survey papers.
  • Associative memory networks (Memory Mosaics) match Transformers on standard benchmarks but have not yet been validated on the full range of downstream tasks where Transformers dominate, limiting confidence in their generality. (affects: Associative Memory Networks (Memory Mosaics))
    Potential fix: Broader evaluation on instruction-following, code generation, and multi-turn dialogue tasks; hybrid architectures combining associative memory with Transformer layers.
  • Memory governance and identity persistence (Memory-as-Ontology) remain purely conceptual with no quantitative evaluation, making it unclear whether constitutional memory can be implemented efficiently at scale. (affects: Cognitively-Grounded Memory Orchestration)
    Potential fix: Developing prototype implementations with measurable identity consistency metrics and formal verification of governance constraints.
  • Cognitive degradation in long-running agents (memory starvation, planner recursion) is identified but defenses are reactive rather than preventive, and have only been demonstrated on a limited set of models. (affects: Cognitive Degradation Lifecycle (QSAF))
    Potential fix: Proactive memory health monitoring integrated into agent training, and standardized stress-test benchmarks for long-running agent deployments.
📚 View major papers in this topic (10)

💡 Cognitive memory principles face their most demanding test in Embodied and Robotic Memory, where physical agents must maintain spatial maps, manipulation histories, and navigation context while operating under real-time constraints that demand tight integration of perception, memory, and action.

📚

Embodied and Robotic Memory

What: This topic covers memory architectures for embodied agents and robots that must maintain, retrieve, and act upon information gathered from physical interactions over time, including navigation history, manipulation experience, spatial maps, and temporal state tracking.

Why: Embodied agents operating in real-world environments encounter fundamentally non-Markovian tasks—cooking a multi-step recipe, navigating back to a previously visited room, or correcting a failed grasp—where the current observation alone is insufficient. Effective memory systems bridge the gap between perception and long-horizon planning.

Baseline: Most conventional robotic policies and world models treat each observation independently (the Markov assumption), feeding only the current frame or a short fixed-length window of recent frames into the policy network, with no explicit mechanism to recall earlier events or spatial context.

  • Balancing short-term reactivity (fast motor control) with long-term recall (tracking task progress across minutes or hours)
  • Compressing massive, redundant sensory streams (video, point clouds, proprioception) into bounded memory without losing decision-critical information
  • Maintaining spatial consistency when revisiting previously observed environments, especially under perceptual drift in generative world models
  • Handling asynchronous information streams where visual perception updates slowly relative to high-frequency action control

🧪 Running Example

❓ A household robot is asked to 'clean the kitchen': it must wipe each counter, load the dishwasher, and take out the trash—a 15-minute task with multiple stages, occlusions, and revisits to previously cleaned areas.

Baseline: A standard VLA policy observing only the current camera frame cannot remember which counters were already wiped. It may re-clean the same counter, skip the dishwasher step entirely because it forgot the instruction sequence, or fail to navigate back to the trash can because it has no spatial map of prior movements.

Challenge: The robot must track semantic progress (which sub-tasks are done), maintain spatial awareness (where the trash can was seen 10 minutes ago), and handle occlusions (items hidden behind doors). The full video history is too large to fit in a context window.

✅ Multi-Scale Embodied Memory (MEM): Compresses recent video into dense visual tokens for reactive control while maintaining a running text summary of completed steps, letting the robot know 'counters done, dishwasher loaded, trash remaining' without storing 15 minutes of raw video.
✅ Perceptual-Cognitive Memory Bank (MemoryVLA): Stores both fine-grained visual snapshots (where the trash can was) and semantic summaries (task progress), using a consolidation mechanism that merges similar entries to stay within memory budget.
✅ Embodied-RAG: Builds a hierarchical spatial-semantic index of the kitchen from the robot's exploration, enabling it to answer 'where is the trash can?' by traversing from abstract room-level nodes down to specific location leaves.

📈 Overall Progress

Embodied memory evolved from simple observation stacking to biologically inspired dual-stream architectures with explicit 3D grounding, enabling robots to sustain coherent behavior over 15-minute horizons.

📂 Sub-topics

Memory for Robotic Manipulation

5 papers

Memory architectures integrated into vision-language-action (VLA) models that enable robots to condition manipulation actions on past observations, overcoming the Markov assumption for multi-step tasks.

Dual-Stream Memory Banks Autoregressive Action Memory

Geometry-Grounded Spatial Memory

4 papers

Memory systems that store and retrieve 3D geometric information—point clouds, depth maps, or spatial coordinates—enabling consistent reconstruction and scene generation during revisits.

Geometry-Indexed Spatial Memory Explicit Spatial Pointer Memory

Memory in World Models

2 papers

Techniques for extending the effective memory span of learned world models used in reinforcement learning and video generation, addressing catastrophic forgetting and perceptual drift.

Augmented Experience Replay Memory Encoding Taxonomies

Retrieval-Augmented Embodied Memory

3 papers

Systems that structure an embodied agent's experience into retrievable databases—hierarchical semantic forests or progressive trajectory stores—enabling query-driven recall for navigation and task planning.

Semantic Forest Memory Progressive Self-Experience Retrieval

💡 Key Insights

💡 The Markov assumption is a critical bottleneck: memory-augmented policies outperform memoryless baselines by 26–39% on temporal tasks.

💡 Explicit 3D geometry outperforms appearance-based retrieval for spatial memory, offering O(1) lookup and 98% storage reduction.

💡 Biological memory models (working vs. episodic vs. semantic) transfer effectively to robot architecture design.

💡 Multi-scale memory—dense visual tokens for short-term, compressed text summaries for long-term—enables 15-minute robot tasks.

💡 Self-generated experience progressively builds better retrieval databases, eliminating the need for expert demonstrations.

💡 Decoupling action frequency from perception frequency via hybrid caching resolves a fundamental VLA design bottleneck.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (2024) established foundational memory primitives—spatial reconstruction memory and retrieval-augmented experience stores. By mid-2025, memory-augmented VLA models demonstrated dramatic gains on manipulation benchmarks, while 2026 brought scaling to long-horizon real-world tasks and principled solutions for asynchronous multi-modal memory streams.

2024-08 to 2024-12 Foundational memory architectures emerge for embodied agents, spanning spatial reconstruction, LLM-based planning memory, and retrieval-augmented experience
  • Spann3R (Spann3R, 2024) introduced spatial memory with working and long-term components for real-time 3D reconstruction at 50+ FPS
  • (KARMA, 2024) integrated long-term and short-term memory into LLM-based embodied planning via memory-augmented prompting
  • (Embodied-RAG, 2024) built hierarchical semantic forests for kilometer-scale navigation retrieval, 7.38x faster than GraphRAG
  • (P-RAG, 2024) introduced progressive self-experience retrieval for embodied task planning without ground-truth demonstrations
2025-01 to 2025-08 Memory-augmented VLA models achieve breakthrough results on robotic manipulation benchmarks, and spatial memory advances to explicit 3D pointer representations
  • SAM2(SAM2Act, 2025) set a new state-of-the-art on RLBench (86.8%) and dominated memory-dependent tasks with 94.3% on MemoryBench, 39.3% above the next best baseline
  • Point3R (Point3R, 2025) replaced implicit memory with explicit 3D spatial pointers and 3D RoPE, generalizing across 14 diverse reconstruction datasets
  • (MemoryVLA, 2025) introduced perceptual-cognitive consolidation, improving +26% over CogACT on real-world temporal manipulation tasks
2025-10 to 2026-03 Memory systems scale to long-horizon tasks (15+ minutes) and address frequency mismatch, continual learning, and training-time memory optimization
  • (Memory Forcing, 2025) introduced chained forward training on model rollouts with geometry-indexed retrieval, achieving 98.2% memory reduction for consistent scene generation
  • (MEM, 2026) combined factorized video encoding with LLM-managed text summaries to enable 15-minute robot tasks like full kitchen cleaning
  • (AR-VLA, 2026) proposed a hybrid key-value cache with dynamic temporal re-anchoring to resolve the frequency mismatch between fast control and slow perception
  • (ARROW, 2026) achieved 4x less forgetting in continual RL through dual replay buffers with reservoir sampling in world models

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Dual-Stream Memory Banks Separate fast-decaying perceptual memory from slow-consolidating semantic memory, mimicking the biological distinction between working memory and hippocampal long-term storage. Fixed-window observation stacking, which discards all history beyond a short horizon and cannot track multi-step task progress. SAM2Act (2025), MemoryVLA (2025), MEM (2026), KARMA (2024)
Geometry-Grounded Spatial Memory Anchor memory to physical 3D coordinates so that spatial proximity governs storage, retrieval, and fusion, eliminating appearance-based drift. Implicit neural memories with fixed capacity that lose information from earlier frames and require expensive global optimization for alignment. 3D Reconstruction with Spatial Memory (2024), Point3R (2025), Memory Forcing (2025), Video World Models with Long-term... (2025)
Autoregressive Action Memory with Hybrid Caching Maintain a rolling action history as a causal sequence with modality-specific caching strategies that respect the different update frequencies of vision and proprioception. Standard VLA models that treat each observation independently ('Markovian amnesia'), resetting temporal context at every control step. AR-VLA (2026)
Retrieval-Augmented Embodied Memory Index embodied experience hierarchically and retrieve relevant subsets on demand, extending text-based RAG paradigms to handle spatial, visual, and trajectory data. Naive approaches that either dump all history into the context window (creating noise and latency) or discard it entirely. Embodied-RAG (2024), Progressive Retrieval Augmented Generation for... (2024), EmBARDiment (2024)
Augmented Experience Replay for World Models Combine short-term plasticity and long-term stability buffers with reservoir sampling and episode splicing to enable continual learning in world models without growing memory. Standard experience replay with a single buffer, which either forgets old tasks or becomes prohibitively large. ARROW (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
MemoryBenchSuccess Rate (%)94.3%SAM2Act (2025)
RLBench (18 tasks)Average Success Rate (%)86.8%SAM2Act (2025)
LIBEROSuccess Rate (%)96.5%MemoryVLA (2025)

⚠️ Known Limitations (5)

  • Memory consolidation heuristics are hand-designed (e.g., fixed similarity thresholds for merging entries), which may not generalize across task domains or time scales. (affects: Dual-Stream Memory Banks, Perceptual-Cognitive Memory Bank (MemoryVLA))
    Potential fix: Learnable consolidation policies that adapt merging thresholds based on task context or prediction error.
  • Spatial memory methods assume predominantly static environments; dynamic objects (moving people, shifting furniture) can corrupt the stored 3D representation and produce incorrect retrievals. (affects: Geometry-Grounded Spatial Memory, Explicit Spatial Pointer Memory)
    Potential fix: Decoupling static and dynamic scene components (as begun in Video World Models) and maintaining separate update schedules for each.
  • Most memory-augmented VLA evaluations are conducted in simulation or controlled lab settings; transfer to unstructured real-world environments with diverse lighting, clutter, and task variability remains underexplored. (affects: Dual-Stream Memory Banks, Autoregressive Action Memory with Hybrid Caching)
    Potential fix: Scaling real-world evaluation datasets and incorporating domain randomization during memory system training.
  • Retrieval-augmented embodied memory incurs latency during retrieval and may retrieve irrelevant experiences when the index is large or the query is ambiguous. (affects: Retrieval-Augmented Embodied Memory, Semantic Forest Memory)
    Potential fix: Adaptive retrieval budgets that scale with task complexity and learned relevance scoring to filter low-quality matches.
  • Neural weight-based memory mechanisms (e.g., Titans) consistently underperform cache-based and SSM-based alternatives in world model settings, collapsing under long-horizon imagination. (affects: Memory Encoding Taxonomies)
    Potential fix: Hybrid approaches combining neural weight memory for abstract summaries with explicit caches for detailed recall.
📚 View major papers in this topic (9)

💡 With memory systems deployed across textual, cognitive, and embodied domains, the Analysis theme systematically evaluates their capabilities and limitations through cognitive benchmarks, safety audits, mechanistic interpretability studies, and hardware profiling to identify where current approaches fall short.

🧩

Analysis

What: Research focused on evaluating, benchmarking, and analyzing memory systems in LLM-based agents and neural architectures, spanning cognitive benchmarks, safety audits, mechanistic interpretability, and hardware profiling.

Why: Without rigorous evaluation, memory-augmented agents may appear capable on simple recall tasks while failing on real-world demands like dynamic updates, implicit reasoning, and adversarial robustness—stalling meaningful progress.

Baseline: Standard evaluation relies on static retrieval benchmarks (e.g., needle-in-a-haystack) and single-session QA, which test simple factual lookup but miss complex capabilities like state tracking, temporal reasoning, and cross-session knowledge transfer.

  • Bridging the gap between high retrieval recall (~90%) and low generation faithfulness (~60%) in memory-augmented systems
  • Evaluating dynamic memory capabilities (updating, forgetting, conflict resolution) rather than static factual recall
  • Ensuring memory systems are robust against adversarial manipulation (memory injection, intent legitimation) while maintaining utility
  • Reconciling fragmented terminology and inconsistent evaluation protocols across the rapidly growing agent memory field

🧪 Running Example

❓ A personal assistant is asked 'What restaurant should we try this weekend?' after 50 prior sessions where the user mentioned becoming vegetarian (session 12), developing a nut allergy (session 30), and moving to a new city (session 45).

Baseline: A static retrieval system might recall the user likes Italian food from session 5 and recommend a steakhouse near the old address, failing to track dietary evolution or the location change because it matches based on keyword similarity rather than temporal state.

Challenge: This requires integrating implicit constraints (vegetarian + nut allergy), updating stale facts (old city to new city), and reasoning about preferences that evolved across sessions—none of which is captured by standard factual recall benchmarks.

✅ Cognitive-Science-Grounded Benchmarking: LoCoMo-Plus and MemBench would detect this failure by testing constraint-consistency and reflective memory, revealing that the agent can recall facts but cannot apply implicit constraints.
✅ Streaming Evaluation with Interdependent Tasks: MemoryArena and AMA-Bench evaluate agents on progressive multi-session tasks where later actions depend on earlier memories, exposing the inability to maintain evolving user state.
✅ Adversarial Memory Safety Testing: PS-Bench would test whether injecting a false memory like 'The user loves trying exotic meats' could override the vegetarian constraint, revealing intent legitimation vulnerabilities.

📈 Overall Progress

Memory evaluation shifted from simple factual recall to demanding cognitive reasoning, dynamic updates, and adversarial robustness—revealing current systems are far less capable than retrieval metrics suggest.

📂 Sub-topics

Memory Benchmark Design

18 papers

Papers creating evaluation frameworks and benchmarks that measure agent memory capabilities beyond simple factual retrieval, including cognitive memory, structural organization, and streaming evaluation.

Cognitive-Science-Grounded Benchmarking Streaming Evaluation with Interdependent Tasks Programmable Atomic Memory Tests

Survey & Taxonomy Analysis

12 papers

Comprehensive surveys that organize the fragmented agent memory landscape into unified taxonomies, defining memory forms, functions, operations, and evaluation protocols.

Unified Memory Taxonomies Operational Lifecycle Analysis

Safety & Adversarial Analysis

8 papers

Research evaluating security vulnerabilities in memory systems, including memory injection attacks, intent legitimation through personalization, hidden state poisoning, and prompt interference detection.

Adversarial Memory Safety Testing Internal Activation Fingerprinting

Mechanistic & Theoretical Analysis

10 papers

Papers investigating how neural networks internally store, retrieve, and process information, including position bias theory, feed-forward layers as memory, and latent learning mechanisms.

Mechanistic Interpretability of Memory Geometric Analysis of Attention

Hardware & Infrastructure Analysis

6 papers

Analysis of physical memory bottlenecks in AI hardware, including the memory wall problem, GPU profiling for LLM inference, DRAM vulnerabilities, and processing-in-memory architectures.

Hardware Memory Wall Analysis GPU-Level Profiling

Personalization & User Modeling Analysis

11 papers

Evaluation of how memory systems support personalization, including benchmarks for dynamic user profiling, inter-user difference modeling, and long-term preference tracking.

Personalization Memory Evaluation Inter-User Difference Analysis

💡 Key Insights

💡 Agents with near-perfect static memory scores fail catastrophically on tasks requiring active memory-guided decisions across sessions.

💡 The 'lost in the middle' retrieval bias is a geometric property of causal attention present at initialization, not a learned artifact.

💡 Memory persistence creates novel attack surfaces where dormant injections bypass all traditional input-level safety defenses.

💡 Retrieval recall exceeding 90% masks generation faithfulness dropping to ~60%, creating a dangerous illusion of capability.

💡 Frontier models achieve only ~50% on dynamic personalization, succeeding at static facts but failing on evolving user states.

💡 Hardware memory bandwidth—not compute—is the primary LLM inference bottleneck, with a widening scaling gap per generation.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

The field evolved from foundational mechanism analysis (2021-2023) through the first long-term benchmarks and taxonomy proposals (2024-2025) to sophisticated cognitive evaluations and safety audits (2025-2026) that expose the critical gap between retrieval recall and genuine understanding.

2021-01 to 2023-12 Foundational memory mechanisms, hardware vulnerability analysis, and early agent architectures
  • Geva et al. (FFN-as-KV, 2021) demonstrated that transformer feed-forward layers function as key-value memory stores with compositional retrieval across layers
  • Kim et al. (RowHammer, 2023) provided a comprehensive retrospective on DRAM read disturbance, showing >80% of commodity modules are vulnerable with worsening trends over a decade
  • (MemGPT, 2023) pioneered virtual context management treating the context window as RAM, achieving +60.4% accuracy on deep memory retrieval
  • Talk2(Talk2Drive, 2023) demonstrated the first LLM-based personalization system controlling a real autonomous vehicle, reducing driver takeover by 75.9%
2024-01 to 2024-12 First long-term benchmarks, hardware wall quantification, and taxonomy emergence
  • (LoCoMo, 2024) created the first very long-term conversational memory benchmark (300+ turns), revealing models lag behind humans by 56-73% on memory tasks
  • (MemWall, 2024) quantified the fundamental 20-year divergence between compute (3.0×/2yr) and bandwidth (1.6×/2yr) scaling, establishing memory as the primary AI bottleneck
  • Zhang et al. (AgentMemSurvey, 2024) proposed a unified taxonomy for agent memory organized by sources, forms, and operations
  • (MemSim, 2024) introduced Bayesian-causal data synthesis for reliable memory evaluation with >99% ground truth correctness
  • (AgentS, 2024) introduced experience-augmented hierarchical planning with narrative and episodic memory, achieving 83.6% relative improvement on OSWorld
2025-01 to 2025-09 Personalization gaps, safety vulnerabilities, and taxonomy consolidation
  • (CM-MI, 2025) demonstrated >80% attack success via memory injection in DeFi agents, showing traditional prompt defenses fail against persistent memory attacks
  • (PersonaMem, 2025) showed frontier models achieve only ~50% on dynamic personalization despite strong static recall (60-70%)
  • (OpTaxonomy, 2025) defined six atomic memory operations and identified KV cache optimization as a rapidly emerging research hotspot via Relative Citation Index analysis
  • (MemQuadruple, 2025) proposed a unified four-part memory definition linking mechanism, evaluation, and governance
  • (GenAgents1K, 2025) validated agent memory at scale with 1,000 agents achieving 0.85 correlation with human survey responses
2025-10 to 2026-03 Interdependent benchmarks, cognitive evaluation, and theoretical breakthroughs
  • (MemoryArena, 2026) proved that agents with saturated static memory scores fail on interdependent multi-session tasks requiring active memory-guided decisions
  • (LoCoMo-Plus, 2026) showed cognitive memory collapses across all models when implicit constraints lack lexical overlap with queries
  • (LitM-Birth, 2026) mathematically proved the U-shaped position bias exists at initialization before any training, achieving 0.99 Spearman correlation with empirical data
  • (AMA-Bench, 2026) demonstrated existing memory systems significantly underperform long-context baselines in agentic scenarios due to lossy compression
  • (Forms-Functions, 2026) unified the field by clearly distinguishing agent memory from RAG and context engineering

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Cognitive-Science-Grounded Benchmarking Memory evaluation should test cognitive capabilities—inference, constraint adherence, temporal reasoning—rather than just factual retrieval accuracy. Static needle-in-a-haystack and single-turn QA benchmarks that conflate retrieval with understanding Evaluating Very Long-Term Conversational Memory... (2024), LoCoMo-Plus (2026), MemBench (2025), Evaluating Memory in LLM Agents... (2025)
Streaming Evaluation with Interdependent Tasks True memory capability is demonstrated not by recall accuracy but by using past information to improve future task completion in evolving environments. Static benchmarks that evaluate memorization and action in isolation Benchmarking Agent Memory in Interdependent... (2026), AMA-Bench (2026), Evo-Memory (2025)
Unified Memory Taxonomies Agent memory must be analyzed along multiple orthogonal dimensions—form, function, and lifecycle operations—to enable meaningful comparison across systems. Ad hoc, application-specific descriptions of memory that conflate RAG, context engineering, and true agent memory Memory in the Age of... (2026), Rethinking Memory in LLM based... (2025), Memory for Autonomous LLM Agents:... (2026), Memory in Large Language Models:... (2025)
Adversarial Memory Safety Testing Memory persistence creates a new attack surface where adversaries can plant long-lived malicious context that evades input-level safety filters. Traditional prompt injection defenses (spotlighting, delimiting) that only protect the immediate input, not persistent memory Real AI Agents with Fake... (2025), When Personalization Legitimizes Risks: Uncovering... (2026), Arbiter (2026)
Mechanistic Interpretability of Memory Understanding how neural networks physically implement memory reveals fundamental architectural constraints that training alone cannot overcome. Treating neural networks as black boxes and attributing memory failures to insufficient training data or model size Transformer Feed-Forward Layers Are Key-Value... (2021), Lost in the Middle at... (2026), Implicit Statistical Inference in Transformers:... (2026)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
MemoryArenaTask Completion Rate~80%Benchmarking Agent Memory in Interdependent... (2026)
LoCoMoMemory QA Accuracy (relative to human)~44% of human performanceEvaluating Very Long-Term Conversational Memory... (2024)
PersonaMemMultiple-choice Personalization Accuracy~50%Know Me, Respond to Me:... (2025)

⚠️ Known Limitations (5)

  • Current benchmarks overwhelmingly focus on English text-based interactions, lacking coverage of multilingual, multi-modal, and non-textual memory scenarios, limiting the generalizability of findings to diverse real-world deployments. (affects: Cognitive-Science-Grounded Benchmarking, Streaming Evaluation with Interdependent Tasks)
    Potential fix: Expanding benchmark coverage to multilingual interactions and multi-modal memory, as demonstrated by LoCoMo's image-sharing capabilities and XPersona's cross-lingual efforts
  • Automated evaluation judges suffer from position bias, order bias, and self-preference bias, causing 'spurious significance' that may invalidate memory evaluation results and lead to overconfident conclusions. (affects: Unified Memory Taxonomies, Personalization Memory Evaluation)
    Potential fix: Using constraint-based evaluation that checks behavioral boundaries rather than matching reference answers, combined with human-in-the-loop validation as proposed in the memory quadruple framework
  • Most memory benchmarks use synthetic or controlled scenarios that lack the noise, ambiguity, and scale of real-world deployed agent interactions, potentially overstating system capabilities. (affects: Cognitive-Science-Grounded Benchmarking, Personalization Memory Evaluation)
    Potential fix: AMA-Bench's approach of combining expert-annotated real-world agent logs with scalable synthetic environments provides a template for bridging this realism gap
  • Memory safety evaluations are category-specific, testing aligned memory-query pairs (e.g., financial memories + financial crimes), making it unclear how attacks generalize across diverse domains and memory architectures. (affects: Adversarial Memory Safety Testing)
    Potential fix: Developing standardized cross-domain adversarial memory test suites covering diverse memory architectures, attack vectors, and deployment contexts
  • Multiple competing taxonomies (3D, operational, quadruple, forms-functions-dynamics) have been proposed without convergence on shared vocabulary, potentially perpetuating the fragmentation they aim to resolve. (affects: Unified Memory Taxonomies)
    Potential fix: Community adoption of a minimal shared ontology defining core terms (episodic, semantic, procedural) with extensible dimensions for domain-specific applications
📚 View major papers in this topic (10)

💡 Analysis reveals the gaps and failure modes of current memory systems, which in turn motivates the creation of standardized Benchmarks—new datasets, evaluation frameworks, and metrics specifically designed to measure dynamic memory capabilities like temporal reasoning, preference tracking, and multi-session knowledge transfer.

🔬

Benchmark

What: This topic covers papers that introduce new benchmark datasets, evaluation frameworks, and metrics specifically designed to measure the memory capabilities of LLM-based agents and personalized assistants.

Why: Without rigorous, standardized benchmarks, it is impossible to meaningfully compare memory systems or identify which capabilities (e.g., temporal reasoning, preference tracking, memory updates) remain unsolved, slowing progress in building truly persistent AI agents.

Baseline: The conventional approach evaluates memory using static, single-session question-answering tasks or simple needle-in-a-haystack retrieval tests, which fail to capture dynamic memory operations like updates, forgetting, and multi-session reasoning.

  • Designing benchmarks that test dynamic memory operations (updates, overwrites, forgetting) rather than just static retrieval
  • Creating realistic long-horizon evaluation scenarios that reflect how users and agents interact over weeks or months
  • Distinguishing genuine memory capability from superficial pattern matching or recency bias in long contexts
  • Building evaluation metrics that capture implicit reasoning (e.g., inferring intent from preferences) beyond surface-level factual recall

🧪 Running Example

❓ A user has been chatting with an AI assistant over 6 months. Three months ago, they mentioned being vegetarian. Last week, they started eating fish. They now ask: 'What should I order at this new Italian restaurant?'

Baseline: A static retrieval system might return the older 'vegetarian' preference because it was mentioned more frequently, recommending a margherita pizza while ignoring the recent dietary change to pescatarian.

Challenge: The assistant must (1) retrieve relevant dietary preferences from hundreds of past sessions, (2) recognize the temporal ordering to prioritize the most recent preference, and (3) apply this updated preference to generate a contextually appropriate restaurant recommendation.

✅ LongMemEval: Tests exactly this scenario by embedding answer-critical evidence within hundreds of task-oriented sessions, requiring temporal reasoning and knowledge update capabilities to surface the most recent preference.
✅ PersonaMem: Evaluates whether the model tracks evolving user traits over chronological life events, specifically testing if models can incorporate the user's latest situation into recommendations.
✅ LoCoMo-Plus: Goes beyond factual recall to test if the assistant can adhere to implicit behavioral constraints (the user now eats fish) even when the trigger query has no lexical overlap with the original dietary discussion.

📈 Overall Progress

Memory benchmarks evolved from static single-session recall tests to dynamic, multi-session evaluations that test temporal reasoning, preference evolution, agentic task completion, and cognitive constraint adherence.

📂 Sub-topics

Conversational Memory Benchmarks

8 papers

Benchmarks evaluating long-term memory in multi-session dialogue settings, testing retrieval, temporal reasoning, and consistency over extended conversation histories.

LongMemEval LoCoMo LoCoMo-Plus MemSim

Personalization & User Profiling Benchmarks

10 papers

Benchmarks measuring how well LLMs track, internalize, and apply individual user preferences and evolving personas across interactions.

LaMP PrefEval PersonaMem RPEval

Agentic & Task-Based Memory Benchmarks

8 papers

Benchmarks that evaluate memory in autonomous agent settings where agents must accumulate experience across sequential tasks and use it to guide future decisions.

AMA-Bench MemoryArena Evo-Memory ATOD

Structural & Cognitive Memory Evaluation

5 papers

Benchmarks testing whether agents can organize knowledge into necessary hierarchies, track mutable states, perform memory rewrites, and handle composite reasoning operations.

StructMemEval Memory Rewriting Diagnostics Programmable Memory Tests

Safety & Adversarial Memory Benchmarks

2 papers

Benchmarks evaluating security vulnerabilities and safety risks that arise when agents rely on persistent memory, including memory injection attacks and intent legitimation.

PS-Bench CrAIBench

Surveys & Unified Evaluation Frameworks

6 papers

Survey papers and meta-analyses that propose unified taxonomies, evaluation protocols, and conceptual frameworks for understanding and assessing memory in AI systems.

Memory Quadruple Framework Forms-Functions-Dynamics Taxonomy Personalized Dialogue Taxonomy

💡 Key Insights

💡 Frontier models achieve only ~50% accuracy on evolving persona tracking, barely above random chance on challenging distractors.

💡 Static benchmark saturation does not transfer: agents excelling at factual recall fail on action-dependent memory tasks.

💡 Cognitive memory collapses when implicit constraints have no lexical overlap with trigger queries.

💡 Persistent memory creates novel attack surfaces, increasing safety violations by up to 243% through intent legitimation.

💡 Temporal reasoning remains the weakest memory capability, with models lagging humans by 73% on causal dynamics.

💡 Automated benchmark generation via Bayesian-causal synthesis achieves >99% correctness while maintaining diversity.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

The field progressed through three waves: foundational personalization benchmarks (2023-2024), sophisticated long-context conversational and preference evaluation (2024-2025), and agentic task-stream benchmarks testing active memory usage in autonomous decision-making (2025-2026). Each wave revealed that state-of-the-art models dramatically underperform on increasingly realistic memory scenarios.

2023-04 to 2024-05 Foundational personalization benchmarks and early dialogue surveys
  • (LaMP, 2023) established the first comprehensive personalization benchmark with 7 diverse tasks, demonstrating +23.5% improvement from retrieval-augmented personalization over generic baselines
  • (Personalized Dialogue Survey, 2024) cataloged 22 datasets and identified PersonaChat as the dominant but limited benchmark, highlighting severe multilingual data scarcity
2024-02 to 2024-12 First-generation long-term conversational memory benchmarks
  • (LoCoMo, 2024) introduced 300+ turn dialogues grounded in Temporal Event Graphs, revealing that long-context LLMs lag behind humans by 56-73% on memory tasks
  • (PerLTQA, 2024) unified semantic and episodic memory in a three-stage evaluation framework with 141 synthetic characters
  • (LongMemEval, 2024) defined five core memory abilities and showed commercial systems suffer 30-60% accuracy drops vs. oracle retrieval
  • (MemSim, 2024) introduced Bayesian-causal data synthesis achieving >99% ground truth correctness for automated benchmark generation
  • (AI Persona, 2024) proposed dynamic learnable user dictionaries and PersonaBench for life-long personalization evaluation
2025-01 to 2025-12 Preference tracking, programmable evaluation, and emerging agentic benchmarks
  • (PrefEval, 2025) showed preference following drops below 10% in zero-shot settings across 3,000 manually curated preference-query pairs
  • (Memory Framework, 2025) decomposed memory into atomic capabilities, showing GPT-4o drops to ~0.45 accuracy on composite Theory of Mind tasks
  • (PersonaMem, 2025) demonstrated frontier models achieve only ~50% accuracy on evolving persona tracking with up to 1M token histories
  • (CrAIBench, 2025) exposed memory injection attacks achieving >80% success on frontier models in DeFi tasks
  • (ETAPP, 2025) introduced proactivity as a core metric for evaluating personalized tool-augmented agents
  • (Evo-Memory, 2025) introduced streaming evaluation for test-time learning, with the ReMem agent achieving 0.92 success rate on navigation benchmarks
  • (Memory in LLMs, 2025; Agent Memory Survey, 2025) proposed unified taxonomies distinguishing agent memory from RAG and context engineering
2026-01 to 2026-03 Agentic, structural, and cognitive memory benchmarks reach maturity
  • (ATOD, 2026) introduced dependency-aware goal completion metrics and dual-store evaluation for multi-goal dialogue agents, achieving 25-30% higher accuracy than LLM judges
  • (PS-Bench, 2026) identified intent legitimation as a novel safety failure, showing personalization increases attack success by up to 243.7%
  • (RPEval, 2026) revealed an 'inverse scaling' effect where more capable models are worse at ignoring irrelevant preferences, achieving ~35% improvement with pragmatic reasoning
  • (MemoryArena, 2026) shifted evaluation to interdependent multi-session tasks, revealing that agents saturating static benchmarks fail on action-dependent memory
  • (AMA-Bench, 2026) addressed agent-specific memory challenges with causality graphs, outperforming memory baselines by 11.16%
  • (StructMemEval, 2026) exposed that modern LLMs cannot spontaneously organize knowledge into required hierarchical structures
  • (LoCoMo-Plus, 2026) revealed cognitive memory collapse when testing implicit constraints with semantic disconnect from surface queries
  • (LifeSim, 2026) modeled users as BDI cognitive agents, showing GPT-5 drops 27.3 points from explicit to implicit intent recognition

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Long-Context Conversational Memory Evaluation Embed answer-critical evidence within realistic, extended multi-session dialogue histories and measure recall, temporal reasoning, and update capabilities. Static single-session QA benchmarks and simple needle-in-a-haystack retrieval tests Evaluating Very Long-Term Conversational Memory... (2024), LongMemEval (2024), LoCoMo-Plus (2026), PerLTQA (2024)
Dynamic User Profile Benchmarking Evaluate personalization by testing whether models prioritize the most recent user state over outdated historical information when both exist in context. Static user profile datasets like PersonaChat where personas never change LaMP (2023), Know Me, Respond to Me:... (2025), Do LLMs Recognize Your Preferences?... (2025), LifeSim (2026)
Agentic Task-Stream Memory Evaluation Shift evaluation from passive recall accuracy to active task completion rate, where success depends on correctly leveraging memories from prior sessions. Dialogue-centric memory benchmarks that ignore machine-generated action logs and environment interactions Benchmarking Agent Memory in Interdependent... (2026), AMA-Bench (2026), Evo-Memory (2025)
Structural & Cognitive Memory Testing Decompose memory into atomic capabilities (search, edit, state tracking, forgetting) and test each independently before combining them into composite evaluations. Benchmarks that test only unstructured retrieval, which can be solved by simple similarity search without genuine memory organization Evaluating Memory Structure in LLM... (2026), How Effectively Can AI Assistants... (2025), Memory Retention Is Not Enough... (2026)
Automated Benchmark Generation Separate structured truth generation from text generation to prevent hallucination in benchmark datasets while maintaining diversity and scalability. Manually curated benchmarks that are static, expensive to create, and susceptible to contamination MemSim (2024), LifeSim (2026), How Effectively Can AI Assistants... (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
LongMemEvalQA Accuracy+5.4% QA accuracy, +9.4% recallLongMemEval (2024)
LoCoMoAccuracy (human-relative)44% of human performance on memory QAEvaluating Very Long-Term Conversational Memory... (2024)
AMA-BenchAverage Accuracy57.22%AMA-Bench (2026)

⚠️ Known Limitations (5)

  • Most benchmarks rely on synthetic data that may not capture the full complexity and messiness of real-world user interactions, limiting ecological validity of results. (affects: PersonaMem, MemSim, LifeSim, MemoryArena)
    Potential fix: Hybrid pipelines combining LLM generation with human annotation and grounding in real behavioral data (as attempted by LoCoMo and LifeSim) can improve realism.
  • Evaluation metrics often reduce complex memory behaviors to single accuracy scores, missing nuanced failure modes like partial recall, outdated information retrieval, or correct reasoning with wrong evidence. (affects: LongMemEval, LoCoMo, LaMP)
    Potential fix: Constraint-consistency evaluation (as proposed by LoCoMo-Plus) and decomposed atomic capability testing (as in the programmable memory framework) offer more diagnostic alternatives.
  • Benchmark contamination risk is high since many test scenarios can be memorized during pre-training, making it unclear whether models genuinely reason or simply recall training data. (affects: LaMP, PrefEval, PerLTQA)
    Potential fix: Parametric, randomized test generation (as in the programmable memory framework) prevents overfitting by producing unique instances for each evaluation run.
  • Most benchmarks evaluate memory in isolation from the full agent loop, testing retrieval accuracy rather than downstream task performance, which can overestimate practical utility. (affects: LongMemEval, PerLTQA, StructMemEval)
    Potential fix: MemoryArena and AMA-Bench address this by evaluating memory through end-to-end task completion in interactive environments.
  • Cross-benchmark comparison is difficult due to inconsistent terminology, different memory type definitions, and varying evaluation protocols across papers. (affects: All benchmark methods)
    Potential fix: Unified taxonomies like the Memory Quadruple framework and Forms-Functions-Dynamics classification aim to standardize definitions and enable fair cross-benchmark comparison.
📚 View major papers in this topic (10)

💡 Benchmarks quantify memory capabilities in controlled settings, but the real test comes in Application, where memory techniques—persistent storage, adaptive retrieval, and experience reuse—are deployed in demanding real-world domains from autonomous driving to code synthesis and multi-agent coordination.

🏆

Application

What: This topic covers papers that apply memory techniques—persistent storage, adaptive retrieval, caching, and experience reuse—to specific domains such as autonomous driving, travel planning, code synthesis, mathematical discovery, and multi-agent coordination.

Why: As LLM-based agents move from general chatbots to domain-specific autonomous systems, memory becomes the critical enabler for personalization, multi-session consistency, and learning from experience without retraining.

Baseline: The conventional approach uses a fixed-context LLM that treats each interaction independently, relying on in-context examples or fine-tuning rather than dynamic, persistent memory across sessions.

  • Bridging the gap between generic memory mechanisms and domain-specific requirements (e.g., causal constraints in agent workflows, real-time driving decisions)
  • Scaling memory systems to handle long-horizon, machine-generated interaction logs rather than short human dialogues
  • Ensuring memory security against adversarial manipulation while maintaining retrieval effectiveness
  • Balancing memory overhead with inference efficiency on resource-constrained devices

🧪 Running Example

❓ A user tells their autonomous vehicle 'I'm running late for a meeting' across multiple rides over several weeks, each time with different traffic conditions and route options.

Baseline: A standard LLM-based driving system would either ignore the abstract command entirely or interpret it literally each time without remembering that this user prefers aggressive acceleration and highway routes when expressing urgency, leading to repeated unsatisfactory experiences.

Challenge: The system must (1) interpret an abstract verbal command as concrete driving parameters, (2) remember past interactions and user feedback to personalize future responses, and (3) adapt to changing conditions while respecting learned preferences—all in a safety-critical real-time domain.

✅ Persistent Agent Memory (Talk2Drive): Stores interaction triples (Command, Policy, Feedback) and retrieves relevant past experiences, learning that this user wants faster driving when expressing time pressure, reducing takeover rate by 65%.
✅ Value-Driven Memory Retrieval (Q-Memory): Assigns learned utility scores to memory items so the system retrieves the most decision-relevant past experiences rather than just semantically similar ones.
✅ Mem0 Dynamic Memory Management: Automatically extracts salient preferences from conversations, consolidates them over time, and maintains a compact user profile that persists across sessions without growing unboundedly.

📈 Overall Progress

Memory in LLM applications evolved from simple interaction logging to adaptive, value-driven retrieval systems that learn what to remember from environmental feedback.

📂 Sub-topics

Personalized Domain-Specific Agents

4 papers

Papers applying memory to build agents personalized to specific domains including autonomous driving, travel planning, and survey editing, where memory enables adaptation to individual user preferences over time.

Persistent Agent Memory Modular Agent Architecture with Memory

Autonomous Agent Memory Systems

6 papers

Papers developing memory architectures for autonomous agents tackling complex multi-step tasks in code synthesis, mathematical discovery, and multi-agent coordination.

Value-Driven Memory Retrieval Causality-Aware Agent Memory Model Context Protocol

Efficient Memory Access and Caching

3 papers

Papers optimizing memory access patterns and caching strategies for efficient inference, including cross-layer index reuse, predictive caching for mobile devices, and processing-in-memory hardware.

Cross-Layer Index Reuse Predictive Hierarchical Caching

Memory Security and Theoretical Foundations

4 papers

Papers addressing adversarial vulnerabilities of memory-augmented agents and theoretical frameworks for understanding memory in neural and biological systems.

Embedding Space Poisoning Biological Key-Value Memory Theory

💡 Key Insights

💡 Memory-augmented agents dramatically outperform stateless baselines in domain-specific tasks, with gains of 65–75% in user satisfaction.

💡 Learned retrieval utility (Q-values) vastly outperforms semantic similarity for selecting relevant memories in agentic workflows.

💡 Existing memory systems designed for dialogue fail on autonomous agent tasks due to machine-generated, symbol-heavy interaction logs.

💡 Memory systems introduce new attack surfaces: embedding-space poisoning can hijack agent behavior with under 0.1% data contamination.

💡 Biological brain memory and Transformer attention are mathematically equivalent, suggesting principled paths for future memory design.

💡 Proactive cache population during idle time dramatically improves hit rates for mobile and resource-constrained deployments.

📖 Show full analysis (timeline, methods, benchmarks)

📅 Timeline

Early work (2023–2024) demonstrated memory's value in specific domains like driving and travel. By 2025, scalable architectures (Mem0, PerCache) and standardized protocols (MCP) emerged. In 2026, the field shifted toward adaptive memory with learned retrieval policies (Q-Memory) and causality-aware storage (AMA-Agent), while benchmarks revealed that existing memory systems still fall far short on autonomous agent tasks.

2023-08 to 2023-12 Early domain applications: first real-world deployments of memory-augmented LLMs
  • Talk2(Talk2Drive, 2023) demonstrated the first LLM-based personalization system on a real autonomous vehicle, using memory of past interactions to reduce driver takeover rates by 75.9%
  • (BB-LDPC, 2023) achieved a 10x reduction in quantum memory overhead by encoding 12 logical qubits in only 288 physical qubits
2024-04 to 2024-09 Expanding applications and exposing vulnerabilities of memory systems
  • (AgentPoison, 2024) revealed that RAG-based agent memory can be hijacked through embedding-space poisoning, achieving 80%+ attack success across driving, QA, and healthcare agents
  • (TravelAgent, 2024) introduced modular agent architecture with dedicated memory for constraint-aware travel planning, achieving 90% rationality vs. 50% for GPT-4
  • (PIM-Opt, 2024) demonstrated that processing-in-memory hardware achieves 3.19x speedup over GPU for ML training by minimizing data movement
2025-01 to 2025-08 Scalable memory architectures and theoretical foundations
  • Mem0 (Mem0, 2025) introduced dynamic memory management with graph enhancements, achieving 26% improvement over OpenAI while reducing latency by 91%
  • (PerCache, 2025) pioneered predictive hierarchical caching for mobile RAG, reducing latency by 34.4% through proactive cache population
  • (KV-Brain, 2025) formalized the mathematical equivalence between biological hippocampal memory and Transformer self-attention
  • (MCP, 2025) proposed a standardized protocol for shared context across multi-agent systems, acting as a universal connector for AI memory
2026-02 to 2026-03 Adaptive memory for autonomous agents and breakthrough domain applications
  • (EvoKernel, 2026) introduced Q-value-driven memory retrieval for NPU kernel synthesis, boosting correctness from 11% to 83% without fine-tuning
  • (AMA-Bench, 2026) established the first benchmark for long-horizon agent memory, revealing that existing memory systems significantly underperform on agentic tasks
  • (NeurosymCollab, 2026) used progressive disclosure persistent memory to enable multi-session mathematical discovery, proving new combinatorial bounds
  • (IndexCache, 2026) achieved 1.82x prefill speedup by reusing token selection indices across transformer layers
  • (BAO, 2026) formulated proactive agent training as multi-objective optimization with retrospective memory and prospective planning

🔬 Key Methods

MethodKey InnovationImproves OnPapers
Persistent Agent Memory Store interaction triples (input, action, feedback) persistently and retrieve them to personalize future agent behavior without updating model weights. Stateless LLM inference that treats each session independently, losing all context between interactions. Personalized Autonomous Driving with Large... (2023), Mem0 (2025), Agentic Neurosymbolic Collaboration for Mathematical... (2026)
Value-Driven Memory Retrieval Learn Q-values for memory items so the agent retrieves memories based on predicted utility rather than surface-level semantic similarity. Semantic similarity-based retrieval (e.g., cosine similarity over embeddings) that ignores task-specific utility. Towards Cold-Start Drafting and Continual... (2026)
Causality-Aware Agent Memory Replace similarity-based memory storage with a causality graph that preserves state transitions, enabling retrieval of causally relevant rather than textually similar experiences. Vector-based RAG and semantic similarity retrieval that lose causal structure when compressing agent interaction logs. AMA-Bench (2026)
Cross-Layer Index Reuse Important tokens remain stable across adjacent transformer layers, so token selection indices can be shared, eliminating 75% of indexer computations. Standard sparse attention (e.g., DeepSeek Sparse Attention) where every layer independently runs a quadratic-cost token indexer. IndexCache (2026)
Predictive Hierarchical Caching Proactively predict and cache future queries at multiple levels of the RAG pipeline during device idle time, rather than reactively caching after queries arrive. Reactive single-level caches (KV cache or semantic cache) that achieve low hit rates under sparse mobile query patterns. PerCache (2025)

📊 Benchmark Results

BenchmarkMetricBest ResultPaper
AMA-BenchAccuracy (%)72.26%AMA-Bench (2026)
KernelBench (Ascend C NPU)Pass@k (%)83.0%Towards Cold-Start Drafting and Continual... (2026)
LOCOMOLLM-as-Judge Score26% relative improvementMem0 (2025)

⚠️ Known Limitations (4)

  • Memory security is largely unaddressed: agents relying on external memory or RAG are vulnerable to poisoning attacks that manipulate behavior without detection, posing serious risks in safety-critical domains like autonomous driving and healthcare. (affects: Persistent Agent Memory, Modular Agent Architecture with Memory)
    Potential fix: Embedding-space anomaly detection, provenance tracking for memory items, and adversarial training of retrieval models.
  • Existing memory systems lose causal structure when compressing agent interaction logs, leading to retrieval of semantically similar but causally irrelevant experiences that mislead agent decision-making. (affects: Persistent Agent Memory, Modular Agent Architecture with Memory)
    Potential fix: Causality graphs (as in AMA-Agent) and structured memory schemas that preserve state transitions and dependencies.
  • Domain-specific memory applications are validated on narrow benchmarks and small-scale deployments, making it unclear how well they generalize across domains or scale to millions of users. (affects: Persistent Agent Memory, Value-Driven Memory Retrieval, Predictive Hierarchical Caching)
    Potential fix: Cross-domain transfer studies, standardized evaluation frameworks like AMA-Bench, and large-scale deployment experiments.
  • Memory overhead grows with interaction history, creating tension between comprehensive memory and inference efficiency, especially on mobile and edge devices where compute and storage are severely constrained. (affects: Persistent Agent Memory, Predictive Hierarchical Caching, Cross-Layer Index Reuse)
    Potential fix: Adaptive memory pruning, hierarchical storage tiers, and resource-aware scheduling that dynamically manages memory capacity.
📚 View major papers in this topic (9)

💡 As memory applications proliferate across domains, Survey papers provide the essential synthesis—unifying fragmented terminology, establishing comprehensive taxonomies, and mapping the open challenges and future directions that span the entire memory research ecosystem.

🎯 Practical Recommendations

PriorityRecommendationEvidence
High Use reinforcement learning to train memory management policies rather than hand-crafting heuristics for what to store and discard. RL-trained policies consistently outperform static rules and can generalize to context lengths 10–400× beyond their training, making them more robust across diverse deployment scenarios. Memory-R1 achieved +28.5% F1 with only 152 training examples, Mem-α generalized from 30K training to 400K+ tokens, and MemAgent extrapolated from 8K to 3.5M tokens with <5% loss.
High Adopt layered memory architectures that separate episodic (event-specific) from semantic (abstracted knowledge) memory stores, with active consolidation and forgetting mechanisms. This cognitive-science-inspired approach consistently outperforms flat vector retrieval for both personalization and long-horizon tasks. Synapse achieved 40.5 F1 on LoCoMo with 95% fewer tokens using episodic-semantic separation. PRIME demonstrated that semantic memory outperforms episodic for capturing user traits. LightMem reduced tokens by 38× while improving accuracy by 29.3% through sensory filtering and sleep-time consolidation.
High Treat the LLM context window as a scarce cache resource (like CPU L1 cache) rather than unlimited storage. Apply OS-inspired demand paging to evict stale content to backing stores and page it back in on demand, dramatically reducing context waste. Pichay reduced context consumption by 93% in production sessions with only 0.025% page fault rate. MemoryOS achieved +49% F1 on LoCoMo with three-tier memory hierarchy and heat-based eviction.
High Use graph-based memory structures with spreading activation or multi-graph architectures for tasks requiring multi-hop reasoning. Graph retrieval surfaces structurally connected but semantically distant memories that standard vector similarity search misses entirely. HippoRAG improved multi-hop QA by 20% at 10–20× lower cost. MAGMA with four disentangled graph layers (semantic, temporal, causal, entity) outperformed MemoRAG and Hi-Mem. AssoMem improved recall by 24.93% in similarity-dense scenarios.
High Implement memory security governance for any agent with persistent memory, including verification protocols, ground-truth anchoring against immutable observation ledgers, and monitoring for intent legitimation attacks. Memory injection attacks succeed at 98% rates through normal queries alone. MINJA demonstrated 98.2% memory injection success via query-only interaction. PS-Bench showed benign memories increase attack success rates by up to 243.7%. SSGM proposes decoupling memory evolution from governance with verification protocols.
Medium Combine parametric (LoRA-based) and non-parametric (retrieval-based) memory for personalization. Per-user LoRA adapters capture implicit behavioral patterns while retrieval provides up-to-date factual context, and the combination consistently outperforms either approach alone. OPPU achieved state-of-the-art across all 7 LaMP tasks by combining both approaches. Comparing RAG and PEFT confirmed they are complementary with +1.06% gain from combination.
Medium Evaluate memory systems using action-coupled benchmarks that test whether recalled information improves downstream task completion, not just retrieval accuracy. Static recall scores dramatically overestimate real-world capability. MemoryArena showed agents with saturated static scores fail on interdependent tasks. LoCoMo-Plus revealed cognitive memory collapses across all models when implicit constraints are tested. GPT-4o drops to ~45% on composite Theory of Mind tasks.
Medium For edge and mobile deployments, persist agent KV caches in quantized (4-bit) format to disk and use predictive pre-fetching during idle time. This transforms multi-agent edge deployment from infeasible to practical with 136× latency reduction for agent switching. Persistent Q4 KV Cache reduced time-to-first-token by 136× on Apple M4 Pro. PerCache reduced mobile RAG latency by 34.4% through proactive query prediction during idle time.

🔑 Key Takeaways

🧠

Memory Is Now a Learnable Skill

Memory management has shifted from a passive engineering problem to an active cognitive capability that agents can learn through reinforcement learning. RL-trained memory policies that learn what to store, update, and delete from task outcomes consistently outperform hand-crafted heuristics. Remarkably, these policies generalize dramatically—models trained on 30K tokens perform well at 400K+, and 8K training context extrapolates to 3.5M tokens.

Agents that learn to forget outperform agents that remember everything.

🏗️

OS Concepts Power AI Memory

Operating system memory management principles—virtual memory paging, demand loading, cache hierarchies, and context switching—transfer remarkably well to LLM memory. PagedAttention eliminated 60–80% KV cache waste and became the serving standard, while Pichay's demand paging reduces agent context consumption by 93%. This OS-to-AI translation has become one of the most productive paradigms in the field.

The best AI memory systems are built like operating systems, not databases.

🔓

Persistent Memory Creates New Attack Surfaces

As memory systems become more capable, they introduce novel security vulnerabilities that traditional prompt-level defenses cannot address. Memory injection attacks succeed at 98% rates through normal queries alone. Even benign personal memories can bypass safety filters through 'intent legitimation,' increasing attack success by up to 243%. Memory governance with verification protocols and ground-truth anchoring is essential for safe deployment.

Every memory an agent stores is a potential weapon an adversary can exploit.

📊

Static Benchmarks Mask Real Failures

Agents scoring near-perfectly on standard memory recall benchmarks fail dramatically when memory must actively guide multi-session decisions. Frontier models achieve only ~50% on dynamic personalization tasks and lag humans by 56–73% on long-term conversational memory. The gap between retrieval recall (90%+) and generation faithfulness (~60%) creates a dangerous illusion of capability that action-coupled evaluation frameworks are now exposing.

High retrieval accuracy hides the fact that models cannot actually use what they remember.

🧬

Cognitive Science Inspires Best Architectures

The most effective memory systems consistently draw from cognitive science—separating episodic from semantic memory, implementing Ebbinghaus forgetting curves for active decay, using hippocampal-inspired graph indexing for associative retrieval, and applying spreading activation for multi-hop reasoning. These biologically-grounded designs outperform engineered alternatives, with Synapse achieving 95% token reduction while improving accuracy through cognitive-inspired dual-layer graph dynamics.

The brain's memory blueprint remains the most reliable guide for building AI memory.

Small Models Beat Giants with Smart Memory

Well-designed memory architectures consistently enable small models to outperform much larger ones. A 4B model with RL-trained memory outperforms GPT-5 on personalization using 16× fewer tokens. External procedural memory built in 56 seconds outperforms models 10× larger. A 7B model with dual memory surpasses GPT-4 on tool use. Memory architecture matters more than model scale for persistent tasks.

The right memory can make a small model smarter than a giant one.

🔭 Research Opportunities

Unified memory evaluation frameworks that test dynamic operations (updating, forgetting, conflict resolution) across conversational, agentic, and multi-modal settings, rather than the fragmented static recall benchmarks that currently dominate the field.

Current benchmarks are scattered across different tasks, metrics, and LLM backends, making cross-method comparison nearly impossible. Static recall scores overestimate capability by 30–40%, and most benchmarks ignore critical operations like memory rewriting and selective forgetting that are essential for real-world deployment.

Difficulty: Medium Impact: High

Privacy-preserving memory architectures that provide provable data minimization guarantees while maintaining personalization quality, addressing the growing tension between memory capability and user privacy as agents store increasingly sensitive personal information.

Current memory systems have no established mechanisms for memory governance, user consent, or secure inheritance when models are upgraded. Memory extraction attacks can recover private information stored in agent memory, and the field lacks formal privacy frameworks for persistent memory.

Difficulty: High Impact: High

Multi-modal memory recall systems that can retrieve and reason over visual, audio, and video memories from past interactions, extending beyond the text-only focus of current memory architectures to match how humans naturally encode and recall experiences.

Most memory evaluation benchmarks focus exclusively on text-based recall, yet real-world personal assistants capture vast streams of multi-modal data (photos, videos, audio). Only Pensieve has addressed multi-modal memory QA, with 14% improvement over standard approaches, suggesting substantial untapped potential.

Difficulty: High Impact: High

Memory consistency protocols for multi-agent systems, analogous to cache coherence in multiprocessor hardware, that guarantee agents see up-to-date and non-contradictory shared information when reading and writing concurrently.

As multi-agent deployments grow, the lack of formal consistency guarantees means agents can act on stale or contradictory information. Computer architecture has decades of cache coherence protocols that could be adapted, but no existing agent system provides formal consistency guarantees.

Difficulty: High Impact: High

Implicit preference extraction and cognitive memory that can capture user preferences expressed through behavior rather than explicit statements, closing the current 27-point gap between explicit and implicit intent recognition observed in frontier models.

Frontier models achieve only ~50% on dynamic personalization tasks requiring evolving user tracking, and cognitive memory collapses across all models when implicit constraints lack lexical overlap with queries. RL-trained memory shows promise (PersonaMem-v2 outperforms GPT-5), but the general problem remains far from solved.

Difficulty: Medium Impact: High

Transferable memory policies that generalize across different LLM architectures, domains, and deployment environments without requiring retraining, addressing the current limitation that RL-based memory management is tightly coupled to specific models and task distributions.

While RL-trained memory policies show impressive within-distribution generalization (8K training to 3.5M tokens), cross-architecture and cross-domain transfer remains largely unexplored. Universal memory models like NAMMs that operate on attention patterns rather than token embeddings offer a promising direction.

Difficulty: Medium Impact: Medium

🏆 Benchmark Leaderboard

LoCoMo

Long-term conversational memory across 300+ turn multi-session dialogues, testing factual recall, temporal reasoning, multi-hop inference, and adversarial robustness (Metric: F1 Score)

RankMethodScorePaperYear
🥇MemoryOS (segmented paging + heat eviction)+49.11% F1 over baselines — +49% average F1 improvement using GPT-4o-miniMemoryOS (2025)2025
🥈Synapse (spreading activation + lateral inhibition)40.5 Weighted F1 — +21.6% over A-Mem with 95% fewer tokensSynapse (2026)2026
🥉Memory-R1 (GRPO-based RL)+28.5% F1 over MemoryOS baseline — +28.5% using only 152 training examplesMemory-R1 (2025)2025

AMA-Bench (Agent Memory)

Long-horizon agent memory over machine-generated interaction logs spanning SQL queries, web navigation, and programmatic environments (Metric: Average Accuracy)

RankMethodScorePaperYear
🥇GPT-5.2 (long-context, upper bound)72.26% — Frontier model still far from perfect, indicating significant room for improvementAMA-Bench (2026)2026
🥈AMA-Agent (causality graph + tool-augmented retrieval)57.22% — +11.16% over strongest memory system baselineAMA-Bench (2026)2026

ALFWorld (Household Tasks)

Interactive household task completion requiring procedural memory and multi-step planning in text-based environments (Metric: Success Rate)

RankMethodScorePaperYear
🥇MACLA (contrastive procedural memory)90.3% — +3.1% positive generalization gap on unseen tasksMACLA (2025)2025
🥈UMEM (unified extraction + management)82.84% — Monotonic performance growth during continuous evolutionUMEM (2026)2026

GAIA (General AI Assistants)

General AI assistant capabilities requiring multi-step reasoning, tool use, and web browsing with persistent memory (Metric: Pass@3)

RankMethodScorePaperYear
🥇Memento (memory-augmented MDP)87.88% Pass@3 — +4.7 to +9.6% over baselines on out-of-distribution tasksMemento (2025)2025
🥈AGENTKB + smolagents73.9% — +18.7pp over smolagents baseline (55.2%)AGENTKB (2025)2025

PersonaMem (Dynamic Personalization)

Tracking evolving user personas over long interaction histories up to 1M tokens with both explicit and implicit preference signals (Metric: Multiple-Choice Accuracy)

RankMethodScorePaperYear
🥇PersonaMem-v2 (RL agentic memory)55% on implicit personalization — Outperforms GPT-5 (~40-48%) with 16× fewer tokensPersonaMem-v2 (2025)2025
🥈Frontier models (GPT-4.5, Gemini-1.5, o1)~50% on PersonaMem — Only ~25% above 25% random baselineKnow Me, Respond to Me:... (2025)2025

📊 Topic Distribution

Linear Memory
23 (6.0%)
Layered Memory
58 (15.1%)
Tree Graph Memory
26 (6.8%)
Memory Internalization
24 (6.2%)
Memory Consolidation
18 (4.7%)
Sparse Memory Qa
5 (1.3%)
Dense Memory Qa
2 (0.5%)
Agentic Memory Architecture
3 (0.8%)
Experience Replay Reflection
6 (1.6%)
Memory Augmented Planning
9 (2.3%)
Multi Agent Memory
4 (1.0%)
Agent Memory Evaluation
1 (0.3%)
Memory Organization
56 (14.5%)
Memory Recall
23 (6.0%)
Memory For Agents
19 (4.9%)
Other
144 (37.4%)
Conversational Dialogue Memory
24 (6.2%)
Long Context Memory
54 (14.0%)
Continual Learning
21 (5.5%)
Personalized Memory
44 (11.4%)
Cognitive Human Memory
20 (5.2%)
Memory Efficiency
30 (7.8%)
Embodied Robotic Memory
14 (3.6%)
Analysis
75 (19.5%)
Benchmark
39 (10.1%)
Application
21 (5.5%)
Survey
20 (5.2%)
📚 Glossary of Terms (276 terms)
Accelerator Affinity
The property that different computational tasks (e.g., embedding generation, LLM inference) perform best on specific hardware accelerators (CPU, GPU, NPU) depending on their computational profile.
Action-Coupled Evaluation
An evaluation paradigm that tests memory by measuring its impact on task performance, rather than testing recall and action independently.
Actor Model
A concurrency pattern where independent processes (actors) communicate through messages, each with isolated state—applied to multi-agent systems for memory isolation between agents.
Agent Memory
The ability of an AI agent to store, retrieve, and apply information from past interactions to inform future decisions and actions.
Agentic Operating System
A system-level layer that manages agent lifecycles, memory, permissions, and inter-agent coordination, analogous to how a traditional OS manages processes and resources.
Agentic Skill
A reusable procedural module that packages a policy (how to act), applicability conditions (when to use it), termination criteria (when to stop), and a callable interface, persisting across tasks unlike one-off plans.
Associative Memory
A memory system (like Hopfield networks) that stores patterns and retrieves the closest stored pattern given a partial or noisy input cue, mimicking how humans recall related concepts from partial triggers.
Attention Entropy
A measure of uncertainty in the attention distribution; high entropy means the model spreads attention broadly across many positions rather than focusing on a few relevant ones.
Attention Sink
A token (often a special prefix) that absorbs excess attention probability, preventing the model from distributing attention chaotically across irrelevant positions.
bAbI QA Tasks
A set of 20 synthetic question-answering tasks designed to test specific reasoning abilities (e.g., spatial reasoning, counting, path finding) over short text stories.
Backward Transfer
The effect of learning new tasks on the performance of previously learned tasks. Negative backward transfer indicates forgetting; positive backward transfer indicates that new learning improves old task performance.
BDI Architecture
Belief-Desire-Intention model from cognitive science, representing agents with beliefs (world view), desires (potential goals), and intentions (committed action plans).
BDI Model (Belief-Desire-Intention)
A cognitive architecture that models agents through their beliefs (what they know), desires (what they want), and intentions (what they plan to do), used for realistic user simulation
Brevity Bias
The tendency of LLM-based context optimization to favor short, generic summaries over detailed domain-specific content, losing important nuances.
Cache Coherence
A protocol ensuring that multiple caches (or agents' working memories) holding copies of the same data remain synchronized, borrowed from multiprocessor hardware design.
Cascade Pruning
A multi-stage filtering process that progressively eliminates irrelevant data sources early (e.g., discarding music streams when answering shopping questions), reducing the data volume the downstream model must process.
Catastrophic Forgetting
The tendency of neural networks to completely lose previously learned knowledge when trained on new data, a central challenge in lifelong and continual learning.
Causal Attribution
Measuring the causal effect of specific input features (e.g., user history) on model output by comparing predictions with and without those features present.
Causality Graph
A structured representation of agent interactions that preserves cause-and-effect relationships between events, enabling retrieval based on logical dependencies rather than just semantic similarity.
Chow-Liu Tree
A maximum spanning tree that approximates the joint probability distribution of random variables using pairwise mutual information, used to optimize the processing order of text chunks.
Class-Incremental Learning (CIL)
A continual learning setting where new classes are introduced over time and the model must classify all seen classes at test time without knowing which task an input belongs to — the hardest standard setting.
Coarse-to-Fine Policy
A robotic control strategy that first predicts an approximate action at coarse resolution, then refines it at finer resolution, improving precision without excessive computation.
Cognitive Debt
The accumulated reduction in a person's cognitive abilities (memory, critical thinking) resulting from prolonged offloading of mental tasks to AI systems.
Cognitive Degradation
A class of internal agent failures—including memory starvation, planner recursion, and context flooding—that cause silent drift in agent behavior over time, distinct from external adversarial attacks.
Cognitive Load Theory (CLT)
A framework from educational psychology proposing that learning is hindered when working memory is overloaded. Applied to LLMs, it suggests that irrelevant context consumes finite attention capacity.
Cognitive Memory
Higher-order memory abilities beyond factual recall, including inferring preferences from behavior, applying implicit constraints, and reasoning about how user states evolve over time
Cognitive Type Safety
A programming language property where the compiler enforces structural validity of LLM inputs and outputs at compile time, preventing schema mismatches and context pollution.
Cold-Start Problem
The challenge of providing personalized responses for new users who have little or no interaction history, requiring the system to make reasonable defaults.
Concept Drift
A change in the underlying data distribution over time in streaming settings, requiring models to adapt to new statistical patterns while retaining relevant prior knowledge.
Constitutional Memory
A governance framework where core identity memories are protected as immutable or require strict protocols to modify, ensuring agent identity persists across model upgrades.
Constraint Consistency
An evaluation approach that checks whether a model's response respects implicit behavioral constraints rather than matching a specific reference answer.
Context Channel Capacity
The mutual information between a continual learning architecture's context signal and its generated parameters, determining the theoretical upper bound on task diversity it can learn without forgetting
Context Collapse
The progressive loss of critical details when an LLM repeatedly rewrites or summarizes its own context, leading to information degradation over time.
Context Drift
The tendency of LLMs to gradually lose track of the original topic in extended conversations as the context fills with diverse or tangentially related information.
Context Engineering
The discipline of designing, structuring, and managing the entire informational environment in which an AI agent makes decisions, encompassing memory management, process isolation, and cost optimization.
Context Flooding
When too much retrieved information is injected into an agent's context window, overwhelming its reasoning capacity and degrading output quality.
Context Rot
The phenomenon where LLM performance degrades as input context grows longer, even within the model's nominal context window, due to attention distraction and information dilution.
Context Switching
Borrowed from OS design: the process of saving the current agent's generation state (including partial outputs and memory pointers) and loading another agent's state, enabling concurrent multi-agent execution on shared LLM resources.
Context Window
The maximum amount of text (measured in tokens) that an LLM can process in a single inference call—analogous to the working memory available for a given conversation turn.
Contextual Heads
A sparse subset of attention heads in a Transformer that are primarily responsible for focusing on task-relevant information within the context window.
Contextual Isolation
A failure mode of flat retrieval systems where structurally linked but semantically distinct memories (e.g., a schedule conflict causing stress) cannot be connected because they lack direct vector similarity.
Contextual Memory
Explicit external information (text, vectors, structured data) stored outside the model and retrieved into the context window when needed.
Continual Learning
A learning paradigm where a model is trained on a sequence of tasks over time, with the goal of learning new tasks without forgetting previously acquired skills.
Cross-Framework Knowledge Transfer
The ability to take problem-solving experience accumulated in one agent framework and apply it to improve performance in a completely different agent framework.
Cross-Session Memory Poisoning
A vulnerability where hallucinated or incorrect content stored in an agent's persistent memory is retrieved and reused in future sessions, propagating errors across interactions.
CRUD Operations
Create, Read, Update, Delete — the four fundamental operations for managing data. AtomMem decomposes memory management into these atomic actions.
Cue-Trigger Disconnect
A situation where the query (trigger) has no surface-level semantic similarity to the relevant stored memory (cue), causing standard retrieval methods to fail.
Cue-Trigger Semantic Disconnect
A benchmark design where the evaluation query (trigger) has no lexical overlap with the relevant memory (cue), forcing the model to rely on deeper understanding rather than keyword matching.
Decision Attribution Analysis
A technique that identifies exactly which reasoning step in an agent's execution trace caused a failure or inefficiency, enabling targeted memory extraction.
Demand Paging
An operating system technique where data is loaded into fast memory only when needed rather than preloaded; applied to LLMs by evicting rarely-used context to external storage and retrieving it on demand.
Dense Memory
A personal memory store with many highly similar or near-duplicate entries (e.g., daily photos of the same locations, repeated calendar events), making retrieval and deduplication particularly challenging.
Diffusion Language Model (dLLM)
A text generation model that starts from a fully masked sequence and iteratively 'denoises' it by predicting and filling in tokens over multiple steps, enabling parallel token generation unlike sequential autoregressive models.
Disagreement Gate
A filtering mechanism that checks whether retrieved knowledge conflicts with the agent's current reasoning, preventing harmful interference from irrelevant past experiences.
Distributed Memory
A memory architecture where each agent maintains its own local memory and explicitly exchanges data with other agents through message passing.
Document-Mutation Coordination
A communication model where agents interact by writing structured updates to shared documents rather than sending direct messages, creating an automatic audit trail of all actions.
Domain-Incremental Learning
A setting where the input distribution changes over time (e.g., different image styles) but the set of output classes remains the same, requiring adaptation to new domains without forgetting old ones.
Dual Memory (Short-term / Long-term)
A memory architecture inspired by human cognition that separates recent, task-specific information (short-term) from distilled, generalizable knowledge accumulated over many experiences (long-term).
Dual-Memory Model
A cognitive science concept distinguishing two types of long-term memory (episodic and semantic), adapted here as an architectural pattern for organizing agent memory systems.
Dual-Process Theory
A cognitive science framework distinguishing fast, automatic processing (System 1 / Familiarity) from slow, deliberate reasoning (System 2 / Recollection).
Dual-Reward RL
A reinforcement learning setup (from MemPO) that combines a sparse trajectory-level reward (was the final answer correct?) with a dense step-level reward (does this memory help generate the correct answer?).
Dual-Stream Architecture
A model design with two parallel information pathways — standard Transformer attention and an explicit memory stream — that merge via learned gates only when beneficial.
Ebbinghaus Forgetting Curve
A model from cognitive psychology showing that memory retention decays exponentially over time unless reinforced by periodic review (spaced repetition).
Elastic Memory
A memory system that dynamically adjusts its size and granularity based on task demands, compressing history into abstractions when context is scarce and expanding details when needed.
Embedding Space Poisoning
An adversarial attack that manipulates the vector representation space used for retrieval, causing the system to return attacker-chosen content for triggered queries.
Energy-Based Routing
A mechanism that selects processing pathways by minimizing an energy function (as in Hopfield networks) rather than using gradient-based optimization, enabling instant per-sample adaptation.
Entity Memory
A dedicated memory layer containing separate learned embeddings for individual entities (people, places, etc.), enabling direct lookup rather than reconstructing entity knowledge from sub-word tokens.
Entropy Drift
A measurable signal indicating that an agent's outputs are becoming increasingly random or uncertain over time, used as a diagnostic indicator for cognitive degradation.
Episodic Memory
Memory of specific past events or interactions stored with temporal context, allowing an agent to recall what happened and when.
Exact Match (EM)
A QA evaluation metric where a prediction is scored correct only if it exactly matches the ground-truth answer string, after normalization.
Experience Distillation
The process of extracting structured, generalizable reasoning strategies from raw interaction trajectories, as opposed to storing complete interaction logs verbatim.
Experience Replay
A technique where an agent stores past interactions in a memory buffer and re-uses them during training or inference to improve learning efficiency and prevent forgetting.
Exponential Gating
A gating mechanism (from xLSTM) that uses exponential functions instead of sigmoid, enabling sharper focus on relevant information and the ability to revise previous storage decisions.
EXTRACT Operator
In ReQAP, an operator that uses small language models to parse unstructured text and dynamically populate virtual table columns, bridging the gap between free-text and structured query execution.
Feature Collision
A problem in expansion-based continual learning where new task-specific features accidentally overlap with frozen features from previous tasks, causing interference and performance degradation.
Feed-Forward Network (FFN) Expert
A small neural network within an MoE layer that processes a subset of inputs; in MoWE, each FFN expert is assigned to specific words or entities.
FIFO Buffer
A First-In-First-Out memory buffer that stores the most recent experiences, automatically discarding the oldest entries when capacity is reached.
FlashAttention
An IO-aware attention algorithm that tiles computation to minimize memory reads/writes between GPU SRAM and HBM, enabling efficient long-sequence attention without materializing the full attention matrix
Focus Directions
Vectors in the key/query activation space of contextual heads that, when added at inference time, steer the model to attend more to relevant context passages.
Forward Transfer
The extent to which knowledge from previously learned tasks helps the model learn new tasks more efficiently or accurately.
Framework-Agnostic Schema
A standardized data format that represents agent execution traces in a way that is independent of any specific agent framework, enabling cross-system knowledge sharing.
Functional Lateralization
The specialization of different memory banks for different types of tasks (e.g., episodic vs. rule-based), inspired by the left-right hemisphere specialization observed in biological brains.
Fusion-in-Decoder (FiD)
A method where multiple retrieved documents are each encoded independently and then combined in the decoder, enabling the model to process many documents without exceeding input length limits.
Fuzzy-Trace Theory
A cognitive science theory positing that humans encode experiences as both verbatim traces (exact details) and gist traces (semantic meaning), with gist traces being more durable and often preferred for reasoning.
GAIA Benchmark
A benchmark for evaluating general AI assistants on multi-step tasks requiring reasoning, web browsing, and tool use, commonly used to compare agent frameworks.
GaLore
Gradient Low-Rank Projection: a training method that projects gradient matrices into low-rank subspaces before the optimizer step, reducing memory for optimizer states without restricting weight expressiveness
Gated Memory
A neural memory mechanism where input, forget, and output gates (inspired by LSTMs) control what information flows into, persists in, and flows out of the memory bank.
Gated Routing
A mechanism that compares an input query embedding to stored memory keys and selectively activates only the relevant memory modules via learned gates, preventing cross-memory interference.
Generative Semantic Workspace (GSW)
A memory representation that transforms raw text into networks of atomic QA pairs or evolving situation states, rather than storing verbatim text chunks. Enables precise reasoning chains and narrative tracking.
Gist Memory
A compressed, fuzzy summary of content that preserves the overall meaning and narrative flow without retaining exact wording, inspired by how humans remember the essence of what they read.
Goodput
The throughput of requests that actually meet a Service-Level Objective (e.g., p99 latency target), as opposed to raw throughput which counts all completed requests regardless of whether they met latency requirements.
GQA (Grouped Query Attention)
An attention variant where multiple query heads share a single key-value head, reducing KV cache size proportionally to the grouping factor.
Ground-Truth Anchoring
A memory governance technique that periodically reconciles evolved (potentially drifted) memory against an immutable record of original observations to correct accumulated errors.
GRPO (Group Relative Policy Optimization)
A reinforcement learning algorithm that optimizes policies by comparing outcomes within groups of attempts, used to train memory management decisions.
Hadamard Product
Element-wise multiplication of two matrices (as opposed to standard matrix multiplication). Used in memory architectures for efficient, parallelizable memory updates.
Hard-Concrete Relaxation
A stochastic method for making discrete binary decisions (keep/prune) differentiable during training by sampling from a stretched and clamped distribution, allowing gradient-based optimization of pruning masks.
Heteroassociative Memory
A memory system where the representation used for storage (value) differs from the representation used for retrieval (key), allowing independent optimization of both.
Hidden State Poisoning Attack (HiSPA)
An attack on recurrent models (like Mamba/SSMs) that corrupts the persistent hidden state through adversarial input strings, causing downstream amnesia or manipulated outputs
Hopfield Network
A classical neural network model for associative memory where stored patterns are local minima of an energy function. Modern variants (Dense Associative Memory) have exponential storage capacity and connections to transformer attention.
Hopfield Network / Hopfield Pooling
An energy-based associative memory model that retrieves stored patterns via energy minimization; modern continuous variants are used for input-conditioned routing in transformers.
Hub Tokens
Special tokens in ARACH that run parallel to the main sequence, aggregating context summaries without position encoding and serving as a dynamic working memory buffer.
HyperNetwork
A neural network that generates the weights of another network based on a context input (e.g., task embedding), treating model parameters as function outputs rather than stored state.
Implicit Preference
User preferences revealed through behavior and choices rather than explicit statements (e.g., always choosing vegetarian options without saying 'I am vegetarian').
Impossibility Triangle
A proven result showing that zero forgetting, online learning (single-pass data), and finite parameters cannot all be achieved simultaneously by sequential state-based learners.
Intent Legitimation
A safety failure where retrieved benign personal memories provide apparent justification for harmful queries, causing the model to bypass safety filters without requiring adversarial prompts.
Interdependent Subtasks
Tasks that are linked such that successful completion of later subtasks depends on information or outcomes from earlier ones.
Internal State (<IS>)
In MEM1, a single evolving text representation that replaces the full interaction history. Updated at every turn via RL-trained consolidation, it serves as the model's working memory.
Inverse Scaling
A counterintuitive phenomenon where more capable models perform worse on certain tasks, observed in personalization where stronger models are more susceptible to irrelevant preference signals
JIT Memory Compilation
A just-in-time approach to memory that delays heavy synthesis until query time, building custom contexts from raw history on demand rather than pre-computing static summaries.
Key-Value (KV) Cache
In transformer models, a cache storing the key and value tensors from previous steps so they don't need to be recomputed, enabling efficient autoregressive generation with historical context.
KILT Benchmark
Knowledge Intensive Language Tasks—a unified benchmark covering multiple knowledge-grounded NLP tasks including QA, fact verification, and dialogue, with a shared Wikipedia knowledge source.
kNN-LM (k-Nearest Neighbor Language Model)
A model that augments neural language model predictions by interpolating with a distribution computed from the k most similar contexts in an external datastore, enabling explicit memory access.
Knowledge Distillation
A technique where a large, capable model generates training data or soft labels that are used to train a much smaller model, transferring knowledge while reducing computational requirements for deployment.
Knowledge Graph
A structured representation of information as entities (nodes) and relationships (edges), used in memory systems to store and retrieve user facts and preferences with explicit semantic connections.
Knowledge Graph (KG)
A structured representation of knowledge as a network of entities (nodes) connected by relationships (edges). In memory systems, KGs organize extracted facts for graph-based retrieval.
KV Cache
Key-Value cache stores the intermediate attention computations for previously processed tokens, allowing the model to avoid redundant computation during autoregressive generation.
KV-Cache
Key-Value Cache: the stored key and value representations from previous tokens in a Transformer model, used during autoregressive generation to avoid recomputing attention over past tokens. Its size grows linearly with sequence length.
LaMP Benchmark
Language Model Personalization benchmark consisting of 7 tasks (classification and generation) designed to evaluate how well LLMs adapt outputs to individual user profiles
Latent Task State
Implicit information about a task's progress and constraints that must be maintained across interactions, even when not explicitly restated.
Latent-Space Memory
Memory stored as continuous vector representations within the transformer's hidden state space, enabling the model to read and write memories through its own attention mechanism.
Lateral Inhibition
A mechanism that suppresses the activation of highly connected hub nodes in a memory graph to prevent common concepts from flooding retrieval results.
LDPC Codes
Low-Density Parity-Check codes, a class of error-correcting codes with sparse parity-check matrices that enable efficient decoding in both classical and quantum error correction.
Least-Privilege Execution
A security principle where agents are confined to restricted environments (e.g., Linux namespaces) and can only invoke pre-approved skills, preventing unauthorized memory or system access.
Locality-Sensitive Hashing (LSH)
A hashing technique that maps similar items to the same bucket with high probability, enabling approximate nearest-neighbor search without expensive pairwise comparisons.
Long-Context Model
A language model capable of processing very long input sequences (e.g., 100k+ tokens), allowing it to consider extensive prior interaction history without explicit memory retrieval.
Long-term Memory
The ability of an AI system to retain, update, and retrieve information across multiple interaction sessions, persisting beyond a single conversation.
Loop Closure
The ability to recognize that a camera has returned to a previously visited location, which is critical for maintaining consistent maps and reconstructions over long trajectories.
LoRA
Low-Rank Adaptation: a parameter-efficient fine-tuning method that freezes pre-trained weights and adds trainable low-rank matrices, reducing memory and compute but constraining the update space
LoRA (Low-Rank Adaptation)
A parameter-efficient fine-tuning method that adds small trainable low-rank matrices to frozen model layers, enabling task-specific or user-specific adaptation with minimal additional parameters (typically <1% of model size).
LoRA Adapter
A Low-Rank Adaptation module that adds a small number of trainable parameters to a frozen LLM, used in memory systems to encode specific knowledge without modifying the base model.
Machine Unlearning
The process of selectively removing specific knowledge or data influence from a trained model, often motivated by privacy regulations like GDPR, without requiring full retraining.
Marginal Utility Reward
A reward signal that measures the net benefit of adding or updating a memory entry by comparing task performance with and without the proposed change.
Markov Assumption
The assumption that the current observation contains all information needed to select the next action, making history irrelevant. Violated in tasks where past events affect current decisions.
Matroid
A mathematical structure that generalizes the concept of independence (like linear independence), used here to model valid subsets of user data satisfying structural constraints.
MemCube
A standardized memory container (from MemOS) that encapsulates any memory type (weights, KV-caches, documents) along with governance metadata such as access permissions, expiration, and format transition rules.
Memory Augmentation
The process of enriching raw stored data (e.g., images) with additional structured metadata (e.g., captions, OCR text, timestamps) to make it more accessible for downstream reasoning.
Memory Consistency
The guarantee that when one agent updates shared information, other agents will see that update in a predictable and timely manner, avoiding stale or contradictory reads.
Memory Consolidation
The process of reorganizing, deduplicating, and strengthening memories during offline periods (analogous to sleep), converting short-term observations into stable long-term knowledge.
Memory Distillation
The process of condensing raw interaction experiences into structured, reusable knowledge that can guide future agent decisions.
Memory Eviction
The process of removing less important items from active memory to make room for new information, analogous to cache eviction in computer systems.
Memory Extraction Attack
A privacy attack that attempts to recover private information stored in an LLM agent's memory by crafting prompts that cause the model to reveal previously stored user data.
Memory Injection
An attack where an adversary implants false or malicious information into an agent's long-term memory store, causing it to produce harmful outputs in future interactions.
Memory Injection Attack
An adversarial technique where false information is planted into an agent's persistent memory store to manipulate its future behavior.
Memory Overwrite
A strategy where new information replaces existing entries in a fixed-size memory buffer, as opposed to appending to a growing log. The agent must decide which entries to overwrite.
Memory Pool
A fixed-size collection of trainable vectors embedded within transformer layers that serves as an internal knowledge store, updated through attention-based compression of input information.
Memory Quadruple
A four-part descriptor (storage location, persistence, write/access path, controllability) proposed for rigorously defining different types of LLM memory.
Memory Rewriting
The ability to selectively overwrite or update stored information when it becomes outdated, as opposed to simply accumulating new memories on top of old ones.
Memory Starvation
A condition where an agent's memory subsystem fails to retain or retrieve critical information, causing it to lose context needed for effective reasoning.
Memory Wall
The growing disparity between processor compute speed and memory bandwidth, which causes memory access to dominate execution time in data-intensive workloads like LLM inference
Memory Weaver
A generative module in MemGen that produces machine-native latent memory tokens on-the-fly when triggered, treating memory recall as active reconstruction rather than static retrieval.
Memory-Augmented Language Model
A language model that supplements its fixed parameters with an external memory store it can read from (and sometimes write to) during generation, enabling access to more information than fits in its context window.
Memory-Augmented MDP
A Markov Decision Process where the agent's state includes both the environment and a retrieval-based memory bank, allowing RL to optimize what past experiences to recall.
Memory-aware Test-Time Scaling (MaTTS)
A technique that uses retrieved memory items to guide diverse exploration at inference time—generating multiple candidate solutions in parallel or sequentially—and selecting the best one based on contrastive signals.
MemoryArena
A benchmark introduced in 2026 that evaluates agent memory through interdependent multi-session tasks across four domains: shopping, travel, search, and reasoning.
Memristor
A two-terminal electronic device whose resistance depends on the history of current flow, enabling non-volatile analog storage useful for in-memory computing and neuromorphic hardware
Mixture-of-Experts (MoE)
A model architecture that contains many specialized sub-networks ('experts') but activates only a few for each input, allowing large total capacity with manageable computation.
Model Context Protocol (MCP)
A standardized interface that allows diverse AI agents to connect to external data sources and tools in a uniform way, acting as a universal adapter for context sharing.
Model Editing
Techniques for updating specific facts or knowledge in a trained model post-deployment without full retraining, ideally affecting only the targeted knowledge while preserving everything else.
Monte Carlo Tree Search (MCTS)
A search algorithm that builds a decision tree by randomly sampling possible action sequences and using the results to guide exploration toward promising solutions.
Mood-Congruent Memory
A psychological principle that people recall memories more easily when their current emotional state matches the emotion encoded in the memory; applied to AI retrieval systems to improve relevance.
Multi-Hop Attention
An architecture where the model reads from memory multiple times in sequence, refining its understanding of which memories are relevant after each pass.
Multi-Hop Reasoning
A reasoning process requiring multiple sequential steps of evidence gathering and inference, where each step builds on the previous one to reach a conclusion.
Multi-Objective Optimization (MOO)
An optimization approach that simultaneously optimizes multiple competing objectives, seeking Pareto-optimal trade-offs rather than a single best solution.
Multi-Session Reasoning
The ability to combine and reason over information distributed across multiple separate conversation sessions rather than within a single session.
Multi-Session Task
A task that spans multiple separate interaction sessions, where information from earlier sessions may be needed to complete later ones.
Multi-Signal Retrieval
A retrieval approach that scores candidates along multiple dimensions (e.g., semantic similarity, time recency, location matching) rather than relying on a single embedding-based similarity measure.
Multimodal RAG
Retrieval-Augmented Generation that operates over multiple data types (images, text, audio), retrieving relevant items by embedding similarity and feeding them to a generative model to produce answers.
Needle-in-a-Haystack
An evaluation paradigm where a specific piece of information is embedded within a large volume of irrelevant text, testing the model's ability to locate and use it.
Noise-Injected Training
A training strategy where irrelevant or misleading retrieved items are deliberately included during training, teaching the model to ignore noise and focus on genuinely relevant context.
Non-Parametric Memory
Knowledge stored externally in databases, vector stores, or text collections and retrieved at inference time, as in RAG systems, rather than being embedded in model parameters.
NUMA (Non-Uniform Memory Access)
A computer architecture where memory access time depends on the memory location relative to the processor, requiring locality-aware data placement for performance.
OpenIE (Open Information Extraction)
An NLP technique that extracts structured (subject, relation, object) triples from unstructured text without a predefined schema. Used by HippoRAG and others to populate knowledge graphs.
Page-Indexed Memory
A memory organization that replaces vector embeddings with a tree of human-readable manifest files, allowing agents to navigate stored information like a structured document rather than searching by similarity.
PagedAttention
An attention algorithm that manages KV cache in non-contiguous memory blocks (pages), inspired by OS virtual memory, to eliminate fragmentation during LLM serving.
Parameter-Efficient Fine-Tuning (PEFT)
Techniques like LoRA, adapters, and prompts that update only a small fraction of a model's parameters during fine-tuning, reducing computational cost and potentially limiting forgetting.
Parametric Memory
Knowledge stored directly in a model's learned weights and biases, accessed implicitly through forward passes rather than through explicit lookup in external stores.
Pareto Frontier
The set of solutions where no objective can be improved without degrading another, representing the best achievable trade-offs between competing goals.
PEFT (Parameter-Efficient Fine-Tuning)
A family of methods including LoRA, adapters, and prompt tuning that fine-tune only a small subset of model parameters while keeping the base model frozen, reducing compute and storage costs.
Pensieve Paradigm
A memory management approach (from StateLM) where the model actively extracts and stores key information while deleting raw source material, maintaining a compact working memory.
Perceptual Drift
The gradual degradation of generated content quality over long autoregressive rollouts in world models, where small errors compound and the model 'forgets' earlier context like room layouts.
Perplexity
A standard metric for language models measuring how surprised the model is by test data; lower perplexity indicates better prediction of the next word in a sequence.
Persistent Memory
A mechanism for storing information from past agent interactions that survives across sessions, enabling the agent to recall and build on previous experiences.
Persona
A structured representation of a user's traits, preferences, background, and behavioral patterns used by an AI system to generate personalized responses.
Personalization-Induced Hallucination
A failure mode where personalized LLMs generate answers aligned with a user's historical biases rather than objective facts, caused by entanglement of user preference and factual representations
Personalized PageRank (PPR)
A graph algorithm that calculates the importance of nodes relative to a specific starting point by simulating random walks, used to find contextually relevant memories in knowledge graphs.
Personalized Thinking
A slow-thinking strategy where an LLM generates user-aligned reasoning traces (via self-distillation) before producing a final response, ensuring the output reflects the user's internalized beliefs and style.
Pipeline Parallelism
A distributed training technique that splits a model across devices by layers, processing different micro-batches on different devices simultaneously. 'Bubbles' refer to idle time when devices wait for data.
Plan Template
A reusable, context-stripped high-level plan for an agent task that can be filled in with specific details at inference time, avoiding the cost of replanning from scratch.
Planner Entrapment
A failure mode where an agent's planning module enters recursive goal loops, repeatedly attempting the same strategy without progress, undetected by standard safety layers.
Plasticity-Stability Trade-off
The fundamental tension between an agent's ability to learn new information (plasticity) and its ability to retain old knowledge (stability).
POMDP
Partially Observable Markov Decision Process: a framework for sequential decision-making where the agent cannot fully observe the environment's state and must maintain probabilistic beliefs.
POMDP (Partially Observable Markov Decision Process)
A mathematical framework for decision-making under uncertainty where the agent cannot fully observe the environment state—used to model how agents should manage memory under incomplete information.
Post-Thinking
A memory maintenance stage where, after generating a response, the agent analyzes its own output to decide what new thoughts to store, what old thoughts to forget, and what to merge.
Predictive Disentanglement
A property of Memory Mosaics where different memory heads naturally specialize in predicting different aspects of the output, making the model's internal representations interpretable.
Predictive Pre-fetching
A technique where a background process anticipates future information needs and retrieves relevant data before it is explicitly requested, reducing wait times during planning.
Prefill / Decode Phases
The two stages of LLM inference: prefill processes all input tokens in parallel to build the KV cache, while decode generates output tokens one at a time using the cached states.
Procedural Memory
Stored action sequences or workflows that an agent can reuse, similar to 'muscle memory' for how to perform specific tasks.
Processing-in-Memory (PIM)
A hardware paradigm that performs computation directly within memory chips, eliminating the data transfer bottleneck between processor and memory in Von Neumann architectures
Progressive Disclosure Memory
A memory management strategy that reveals stored information incrementally across sessions, maintaining relevant context without overwhelming the agent's context window.
Q-Memory
A memory retrieval approach that assigns learned utility scores (Q-values) to stored experiences, selecting memories based on predicted usefulness rather than text similarity.
Q-Value
In reinforcement learning, the expected cumulative reward for taking a specific action in a given state; used in memory systems to score the utility of retrieving specific memory items.
RAG
Retrieval-Augmented Generation—a technique that augments LLM responses by retrieving relevant documents from an external knowledge base and including them in the prompt context.
RAG (Retrieval-Augmented Generation)
A technique where an agent retrieves relevant documents or past information from a knowledge store and uses them to augment its response generation.
Reason-Retrieve-Refine Loop
A three-step memory interaction pattern where the agent first reasons about what it needs, retrieves relevant experiences, then refines its plan based on the retrieved information.
Recursive Question Decomposition
Breaking a complex question into a tree of simpler sub-questions and operators (retrieve, extract, aggregate) that are solved step by step, enabling the system to handle queries that require multiple reasoning steps.
Reflective Memory
A memory store specifically designed to hold past self-correction episodes and verified standards, used to ground future reflection in evidence rather than speculation.
Reflexion
A technique where agents use verbal self-critique stored as memory to improve performance on subsequent attempts at the same or similar tasks.
Relative Citation Index (RCI)
A metric that normalizes citation counts by publication age to identify emerging high-impact research topics within a field, filtering out effects of older papers accumulating citations.
Reranker
A secondary model that re-scores and reorders initially retrieved items to improve retrieval precision, often trained with task-specific relevance signals.
Reservoir Sampling
A randomized algorithm for selecting a fixed-size representative sample from a data stream of unknown or growing length, ensuring each item has equal probability of inclusion.
Residual Memory
A parallel parameter layer initialized to zero that stores knowledge edits additively, preserving the original pre-trained weights while enabling targeted updates without overwriting.
Retrieval Heads
Specific attention heads within a Transformer that specialize in copying or retrieving information from the context, functioning as the model's primary mechanism for information lookup.
Retrieval-Augmented Generation (RAG)
A technique where relevant documents or memories are retrieved from an external store and injected into the LLM's context to ground its generation in specific, up-to-date information.
RETRIEVE Operator
In ReQAP, a high-recall retrieval step with cascade pruning that fetches relevant records from large heterogeneous data stores based on the sub-question's requirements.
Retrofitting
Continued pre-training of an existing model on a small fraction (2-8%) of its original training data to teach it new behaviors (like cache compression) without full retraining.
RoPE (Rotary Position Embedding)
A positional encoding method that encodes relative position information by rotating query and key vectors, widely used in modern LLMs but with cache memory costs that scale with context length
RoPE (Rotary Position Encoding)
A widely used positional encoding method that encodes position information by rotating query and key vectors, enabling relative position awareness in attention computations.
RoPE (Rotary Positional Embedding)
A method for encoding token positions in Transformers by applying rotation matrices to query and key vectors, enabling the model to capture relative positions. Its cache grows linearly with context length.
RowHammer
A DRAM vulnerability where repeatedly accessing a specific memory row causes electrical interference that flips bits in physically adjacent rows, exploitable for security attacks
Salient Tokens
In the context of selective inference, tokens whose internal representations are actively changing between computation steps, as opposed to stable tokens that can safely reuse cached values.
Scene Graph
A structured representation of a scene as a graph where nodes represent objects and edges represent spatial or semantic relationships between them.
Segmented Paging
An OS-inspired memory organization where conversation turns (pages) are grouped by topic (segments), enabling efficient retrieval at both topic and turn granularity.
Self-Distillation
A training technique where a model learns from its own outputs, using its generated reasoning traces as training signal to improve alignment with desired behavior.
Self-Evolution / Self-Play
A training paradigm where an agent generates its own training data through self-interaction or environmental exploration, then fine-tunes on this self-generated experience to improve over time.
Semantic Cache
A fast-access local store indexed by document embeddings that holds pre-fetched information, enabling sub-millisecond retrieval when a cache hit occurs.
Semantic Drift
The gradual distortion of stored knowledge that occurs when memory is repeatedly summarized, compressed, or rewritten, causing accumulated deviations from the original information.
Semantic Information Bottleneck
An information-theoretic approach that compresses data by retaining only the information that is maximally relevant to a downstream task, discarding irrelevant details while preserving semantic content.
Semantic Memory
Consolidated general knowledge extracted from experiences—facts, rules, and user preferences distilled from episodic memories over time.
Semantic Neighborhood
A cluster of similar queries or tasks used to evaluate whether a stored memory generalizes beyond a single instance, ensuring memories capture patterns rather than noise.
Semantic Neighborhood Modeling
Evaluating a memory's quality not just on the current query but across a cluster of semantically similar queries, ensuring the memory generalizes rather than overfitting to one instance.
Semantic Routing
A dispatch mechanism that maps input embeddings to stored memory modules based on semantic similarity, dynamically activating only the most relevant modules for each query.
Semantic Similarity Retrieval
Finding stored information by computing vector similarity between the query and stored embeddings; effective for related concepts but fails when relevant items lack surface-level overlap.
Sensory Memory
The first layer of a cognitive memory hierarchy that briefly holds raw, unprocessed input and filters out low-value information before passing it to short-term memory.
Shared Memory
A memory architecture where multiple agents read from and write to a common data store, enabling implicit communication without direct message passing.
Sleep-Time Consolidation
An offline memory processing phase (between user interactions) where the system performs expensive operations like deduplication, graph densification, and conflict resolution without affecting response latency.
Sleeper Injection
A memory-layer attack where malicious instructions are planted in an agent's persistent storage and remain dormant until triggered by a future benign query.
Sliding Window Attention
An attention mechanism where each token only attends to a fixed number of nearby tokens, reducing complexity from quadratic to linear but losing access to distant context.
Social Knowledge Base
A memory store built from group interaction history (messages, reactions, norms) that allows an agent to adapt its behavior to match the social context of a specific community.
Soft Compression
Compressing text into learned continuous vector representations (rather than shorter text), which the model can process but which are not human-readable.
Soft Prompt
Continuous vector embeddings prepended to model input that steer behavior, used to inject personalization signals without modifying model weights.
Softmax Attention
A mechanism that assigns continuous probability weights to memory entries based on relevance, allowing differentiable (gradient-based) training of memory access.
Spaced Repetition
A learning strategy where review intervals increase over time as knowledge becomes more consolidated, optimizing the trade-off between review frequency and retention.
Sparse Attention
An attention mechanism that attends to only a subset of positions rather than all positions, reducing computational cost from quadratic to sub-quadratic in sequence length.
Sparse Autoencoder (SAE)
A neural network that learns compressed, sparse representations of input data, used in personalization to distill preference signals from noisy behavioral embeddings.
Sparse Memory Access
A mechanism where only a small subset of stored memory entries or model parameters are activated for a given query, reducing computation compared to reading all stored information.
Spatial Memory
An external memory structure that stores 3D geometric information (point clouds, poses) to maintain long-term spatial consistency in video generation or 3D reconstruction
Speculative Decoding
An inference acceleration technique where a fast 'draft' model proposes multiple future tokens that the larger 'target' model verifies in parallel, reducing generation latency
Spreading Activation
A cognitive-inspired retrieval mechanism where activation energy flows from query nodes through a graph's edges to connected nodes, surfacing associatively related memories.
Stability-Plasticity Dilemma
The fundamental trade-off between a model's ability to retain old knowledge (stability) and its ability to learn new information (plasticity). Increasing one typically decreases the other.
Static Memory Benchmark
An evaluation that tests memory through recall or question-answering over past text, without requiring the agent to take actions based on recalled information.
Steerable Reasoning
A reasoning approach where the language model explicitly reflects on selectively retrieved memory to decide which exploration action to take next, rather than reasoning in a fixed pattern.
Structured Pruning
Removing entire architectural components (neurons, attention heads, layers) from a model rather than individual weights, producing genuinely smaller and faster models without needing sparse hardware support.
Surfel (Surface Element)
A point-based 3D representation storing position, normal, and appearance data, used as a lightweight geometric anchor for memory retrieval in video generation.
SWE-bench Lite
A benchmark that evaluates agents on their ability to resolve real software engineering issues from GitHub repositories, testing code understanding and repair capabilities.
Task Completion Rate
The percentage of tasks an agent successfully finishes, used here as a metric that captures whether memory actually improves downstream performance.
Task-Incremental Learning (TIL)
A continual learning setting where the model is told which task an input belongs to at test time, making it easier than class-incremental learning because the model only needs to distinguish within a task.
Task-Oriented Memory Augmentation
The offline process of enriching raw memory data (e.g., images) with structured text annotations like OCR results, captions, and metadata to make them searchable and processable by text-based reasoning systems.
Temporal Event Graph
A structured representation of causally linked life events ordered in time, used to generate consistent long-term dialogue scenarios for evaluation.
Temporal Reasoning
The ability to reason about the order, timing, and causal relationships between events, such as understanding that a recent preference supersedes an older one.
Test-Time Learning
The ability of a model to improve its performance during inference by accumulating and reusing experience from solved tasks, without updating its parameters.
Text-to-SQL
An approach that translates natural language questions into SQL queries to retrieve answers from structured databases; it struggles with unstructured text data.
Textual Gradient
A technique where an LLM generates natural-language feedback (analogous to numerical gradients) about what went wrong, used to iteratively improve prompts or personas without gradient descent
Three-Layer Memory Hierarchy
An agent memory architecture with three tiers: I/O (interface layer for input/output), Cache (fast working memory for active tasks), and Memory (persistent long-term storage).
Token
The basic unit of text processed by an LLM (roughly 3/4 of a word in English); context window size and inference costs are measured in tokens.
TopHash Retrieval
A retrieval mechanism in MEMOIR that generates sample-dependent sparse binary masks from activation magnitudes to efficiently match new queries to stored edits without dense similarity search.
TriviaQA
A large-scale open-domain QA benchmark containing trivia questions with evidence from Wikipedia and web documents, commonly used to evaluate knowledge retrieval.
TSDF (Truncated Signed Distance Function)
A 3D volumetric representation that encodes the distance from each point in space to the nearest surface, commonly used to fuse multiple depth observations into a consistent 3D model.
TTFT (Time-To-First-Token)
The latency from receiving a request to generating the first output token, a critical metric for interactive applications. Dominated by prefill computation or KV cache loading time.
Virtual Context Management
An approach that treats the LLM context window as fast RAM and external storage as slow disk, paging information in and out to simulate unlimited context.
VLA (Vision-Language-Action) Model
A neural network that takes visual observations and language instructions as input and directly outputs robot actions, combining perception, language understanding, and motor control in one model.
VLM (Vision-Language Model)
A model that jointly processes visual and textual inputs, capable of understanding images and generating text-based responses about visual content.
Voronoi Cell
In token pruning context, the region of query space for which a particular token embedding provides the highest similarity score. Tokens with larger Voronoi cells are more important to retain.
Workflow Induction
The process of automatically extracting reusable action templates from an agent's past task-solving trajectories by abstracting specific values into parameterized placeholders.
Working Memory
The active, limited-capacity memory used during current task processing, analogous to the LLM's context window during a single interaction.
World Model
A learned internal model that predicts how the environment will change in response to actions, allowing an agent to 'imagine' outcomes without actually executing them.
Zeroth-Order Optimization
Optimization methods that estimate gradients using only function evaluations (forward passes) rather than backpropagation, eliminating the memory overhead of storing activations for gradient computation
Zettelkasten
A note-taking system where each idea is stored as an atomic note with unique identifiers and explicit links to related notes. Inspired A-Mem's approach to self-organizing agent memory.