MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

📝 Paper Summary

Memory internalization Agent evolution

MemGen equips LLM agents with a dynamic memory system that generates latent tokens on-demand during reasoning, interleaving memory and cognition without modifying the core model weights.

Core Problem

Existing agent memory paradigms either rely on rigid retrieval from external databases (lacking fluid integration with reasoning) or update model parameters directly (causing catastrophic forgetting).

Why it matters:

Parametric memory methods like SFT suffer from catastrophic forgetting when learning new tasks, eroding general knowledge
Retrieval-based memory (RAG, ExpeL) is tethered to context engineering and often retrieves static information once at the start, failing to support dynamic, multi-step reasoning
Current systems lack the human-like ability to fluidly interweave memory recall with ongoing thought processes

Concrete Example: In a task like 'Find a flight from JFK to LAX and book a ride', a retrieval-based agent might fetch all flight info at the start. However, when it later realizes the API is down during execution, it lacks a mechanism to dynamically recall an alternative strategy (e.g., 'use an iterative search paradigm') mid-reasoning.

Key Novelty

Dynamic Generative Latent Memory (MemGen)

Decouples memory from the core reasoner by using a separate 'Memory Weaver' module that generates machine-native latent tokens (memory) only when triggered
Introduces a 'Memory Trigger' acting as a metacognitive monitor that decides exactly when to pause reasoning and insert memory tokens based on the agent's current hidden states
Treats memory as a generative act of reconstruction rather than static retrieval, allowing the agent to synthesize bespoke cognitive context on the fly

Evaluation Highlights

+31.7% improvement on ALFWorld and +27.1% on KodCode with Qwen3-8B compared to vanilla baselines
Surpasses parametric memory methods (REINFORCE++) by +5.8% and retrieval systems (ExpeL, AWM) by up to 38.22% on ALFWorld
Demonstrates strong cross-domain generalization: training on math tasks improves science reasoning (+6.06%) and code generation (+5.1%) without direct supervision

Breakthrough Assessment

9/10

Proposes a fundamentally new memory paradigm (generative latent memory) that solves the rigidity of RAG and the forgetting of SFT. Strong empirical results and emergent human-like memory hierarchy justify the high score.

⚙️ Technical Details

Problem Definition

Setting: Joint optimization of a fixed policy π_θ and a memory system M to maximize expected reward over a task distribution, utilizing past experience history H

Inputs: Current environment state s_t and interaction history

Outputs: Action sequence a_t augmented with generated latent memory tokens m_t

Pipeline Flow

Frozen Reasoner (generates tokens autoregressively)
Memory Trigger (monitors hidden states at delimiters to decide invocation)
Memory Weaver (generates latent memory tokens if triggered)
Reasoning Resumption (Reasoner continues generation conditioned on inserted latent tokens)

System Modules

Frozen Reasoner

Generate action tokens and provide hidden states for monitoring

Model or implementation: LLM (e.g., Qwen3-8B, SmolLM3-3B), frozen weights

Memory Trigger

Decide whether to invoke memory generation at specific delimiters (e.g., punctuation)

Model or implementation: LoRA adapter on Reasoner

Memory Weaver

Synthesize latent memory tokens to guide reasoning

Model or implementation: LoRA adapter on Reasoner

Novel Architectural Elements

Interleaved generation cycle where a Trigger interrupts the main LLM to insert generated latent tokens from a separate Weaver module
Use of separate LoRA adapters for metacognitive monitoring (Trigger) and memory synthesis (Weaver) while keeping the base model frozen

Modeling

Base Model: Qwen-2.5-1.5B, SmolLM3-3B, Qwen3-8B

Training Method: Two-stage training: (1) Memory Weaver training via SFT or GRPO, (2) Memory Trigger training via RL

Objective Functions:

Purpose: Train the Weaver to generate useful memory tokens that maximize task reward.

Formally: Maximize E[R(τ)] optimizing only Weaver parameters θ'
Purpose: Train the Trigger to invoke memory only when necessary (sparse activation).

Formally: Maximize R(τ) - λ * Σ max(0, activation_prob - threshold)

Adaptation: LoRA (Low-Rank Adaptation) for both Trigger and Weaver modules

Key Hyperparameters:

latent_memory_length_K: {2, 4, 8}
learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Inference delay ranges from 24% to 94% of vanilla LLM latency (efficient due to sparse activation)

Comparison to Prior Work

vs. ExpeL/AWM: Generates memory as latent tokens rather than retrieving static text; integrates memory dynamically during reasoning rather than just at the start
vs. REINFORCE++/SFT: Updates a separate Weaver module instead of the base model, avoiding catastrophic forgetting
vs. SoftCoT: Uses a learned Trigger to decide *when* to insert memory, whereas SoftCoT typically inserts at fixed positions or steps [not cited in paper as SoftCoT is cited, but distinction is dynamic triggering]
+ 1 more
vs. MemoryBank: Generative reconstruction of memory vs. rigid database retrieval

Limitations

Latent memory tokens are not human-readable, making interpretability reliant on post-hoc analysis
Depends on the frozen reasoner's capacity; very small models might not effectively utilize latent context
Requires training two separate modules (Trigger and Weaver), adding complexity over simple SFT

Reproducibility

Code: https://github.com/KANABOON1/MemGen

Code publicly available at https://github.com/KANABOON1/MemGen. Paper provides algorithms and architectural details but omits specific learning rates and batch sizes in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluation across 9 benchmarks in 5 domains (web search, embodied action, math, science, coding)

Benchmarks:

ALFWorld (Embodied decision making)
TriviaQA (Web search / QA)
KodCode (Coding)
GSM8K (Math reasoning)
GPQA (Scientific reasoning)

Metrics:

Success Rate / Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on Qwen3-8B showing MemGen surpassing both parametric and retrieval baselines.
ALFWorld	Success Rate	85.60	90.60	+5.00
KodCode	Success Rate	72.90	76.16	+3.26
PopQA	Accuracy	40.33	62.30	+21.97
Results on smaller model (SmolLM3-3B) demonstrating significant gains where baselines struggle.
ALFWorld	Success Rate	18.96	63.60	+44.64
TriviaQA	Accuracy	46.20	79.30	+33.10

Main Takeaways

MemGen consistently outperforms retrieval-based methods (ExpeL, AWM), especially on reasoning-intensive tasks where static retrieval fails
Emergent memory hierarchy: Post-hoc analysis reveals latent tokens specialize into planning, procedural, and working memory functions without explicit supervision
Cross-domain generalization: Training on one domain (e.g., Math) improves performance on others (e.g., Science, Code), unlike SFT which often degrades unseen domains
Continual learning: MemGen retains performance on earlier tasks (e.g., AQuA) better than SFT after sequential training on new tasks

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics (policy, reward, trajectories)
Transformer architecture (hidden states, KV cache)
LoRA (Low-Rank Adaptation)
Latent space representation in LLMs

Key Terms

Latent Memory: Machine-native memory representations (vectors/tokens) generated internally by the model rather than retrieved as natural language text

Memory Trigger: A learned module that monitors the reasoning process and decides when to interrupt generation to insert memory tokens

Memory Weaver: A generative module (LoRA adapter) that synthesizes latent memory tokens based on the current context and internal knowledge

Catastrophic Forgetting: The tendency of neural networks to lose previously learned knowledge when trained on new data

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on relative performance within a group of outputs

SFT: Supervised Fine-Tuning—training a model on labeled examples

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small set of added parameters