MEM-α: Learning memory construction via RL

📝 Paper Summary

Memory organization Agentic memory management Reinforcement Learning for Agents

Mem-α uses reinforcement learning to train LLM agents to actively manage a complex, multi-component memory system, optimizing directly for downstream question-answering accuracy rather than relying on fixed rules.

Core Problem

Current memory-augmented agents rely on fixed, pre-defined rules or prompts to update memory, but LLMs often fail to determine what to store, how to structure it, or when to update it effectively, especially in complex scenarios.

Why it matters:

Pre-defined rules are brittle and cannot adapt to diverse interaction patterns, leading to information loss or bloated memory
Even state-of-the-art models like GPT-4o struggle to spontaneously select the correct tools for complex memory updates without explicit training
Small models with weak instruction-following capabilities get overwhelmed by complex memory tool sets, making effective long-term memory inaccessible to efficient models

Concrete Example: When an agent receives a long stream of information containing a mix of casual conversation, storytelling, and factual documents, a rule-based system might save everything (overflowing context) or miss subtle plot points. Mem-α learns to specifically extract only the facts necessary to answer future questions while discarding noise.

Key Novelty

Reinforcement Learning for Active Memory Construction

Formulates memory management as a sequential decision-making problem where the agent decides how to update Core, Episodic, and Semantic memory chunks
Optimizes memory construction directly against downstream QA performance (RAG accuracy) rather than supervising the memory trace itself, allowing the agent to discover its own optimal storage strategies
Achieves massive length generalization: trained on 30k token sequences but generalizes to >400k tokens

Architecture

The memory architecture and interaction flow. It displays the three memory components (Core, Semantic, Episodic) and the allowed operations for each.

Evaluation Highlights

Generalizes to sequences exceeding 400k tokens (13× the max training length of 30k) while maintaining high retrieval accuracy
Outperforms existing memory baselines (including MemGPT and Mem0) across diverse interaction patterns
Demonstrates that RL enables agents to learn fundamental memory principles (what to keep/discard) rather than just memorizing patterns

Breakthrough Assessment

8/10

Significant advance in making memory agents 'active' learners rather than passive rule-followers. The 13x length generalization from training to inference is a particularly strong result for RL-based methods.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making for memory construction over a stream of conversation chunks

Inputs: A sequence of conversation chunks C = {c_1, ..., c_n}

Outputs: A sequence of memory write actions A = {a_1, ..., a_n} resulting in a final memory state M_n

Pipeline Flow

Input Processing: Chunk Stream → Action Generation
Memory Operations: Execute Actions → Update Memory Components
Evaluation (Reward Calculation): RAG → Answer Generation → Reward Computation

System Modules

Memory Agent (Policy)

Decides which memory operations to perform based on current chunk and memory state

Model or implementation: Qwen-2.5-7B-Instruct (base model for training)

Memory System

Stores and organizes information across three components

Model or implementation: Structured Data Stores (Core, Episodic, Semantic)

Retriever

Retrieves relevant memory entries to answer evaluation questions

Model or implementation: BM25 (frozen)

Novel Architectural Elements

Three-component learnable memory architecture (Core, Episodic, Semantic) where update policies are learned via RL rather than scripted
Decoupled training pipeline: Write policy is learned via RL, while Read/Retrieval is fixed (BM25) to isolate memory construction quality

Modeling

Base Model: Qwen-2.5-7B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward of memory actions.

Formally: J(θ) = E [ 1/G * sum( (r_i - mean(r)) / std(r) * clip_ratio ) ]
Purpose: Reward Correctness (QA Accuracy).

Formally: r1 = l/m (ratio of correctly answered questions)
Purpose: Reward Format Validity.

Formally: r2 = percentage of successfully executed tool calls
Purpose: Reward Compression/Efficiency.

Formally: r3 = 1 - (memory_length / context_length)
Purpose: Reward Semantic Validity (verified by external model).

Formally: r4 = fraction of semantically valid updates (checked by Qwen3-32b)

Training Data:

4,139 total instances spanning diverse patterns (Conversation, Document, Pattern, Story)
Stratified sampling used to create a balanced subset of 562 instances for RL training

Key Hyperparameters:

max_training_length: 30,000 tokens
reward_weights: Tunable parameters β and γ in reward function

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemGPT: Mem-α learns the update policy via RL instead of relying on prompt engineering; MemGPT is static.
vs. Memory-R1: Mem-α handles a complex 3-part memory structure (Core/Episodic/Semantic) capable of evolving knowledge, whereas Memory-R1 uses simpler text-only memory or simple lists.
vs. SELF-PARAM [not cited in paper]: SELF-PARAM internalizes memory into weights; Mem-α uses external explicit storage, allowing for infinite capacity and editability.
+ 1 more
vs. MIRIX: Mem-α trains the agent to use tools, whereas MIRIX expects the model to use complex tools zero-shot (which fails for smaller models).

Limitations

Conflict resolution (handling contradictory information) was excluded from evaluation due to lack of realistic benchmarks.
Reliance on a fixed retriever (BM25) during training means the memory structure is optimized specifically for lexical overlap, potentially limiting semantic retrieval capabilities.
Computational overhead of RL training on long sequences necessitated using a small subset (562 instances) of the full dataset.
Reward signal is sparse (delayed until end of sequence), which is generally hard to optimize, though intermediate format rewards help.

Reproducibility

Code availability is not explicitly provided in the paper. Dataset details (4,139 instances) and construction methods are described. The base model is open weights (Qwen-2.5), but the trained weights and exact training scripts are not linked.

📊 Experiments & Results

Evaluation Setup

Memory construction followed by RAG-based Question Answering

Benchmarks:

MemoryAgentBench (Long-context Question Answering requiring memory) [New]

Metrics:

Accuracy (QA correctness)
Memory Quality (Compression ratio)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

The RL training framework loop.

Main Takeaways

RL training allows agents to learn efficient memory management strategies that generalize far beyond their training horizon (30k -> 400k tokens).
The specialized 3-component memory architecture (Core, Episodic, Semantic) provides necessary structure for diverse information types.
Direct optimization for QA performance naturally encourages the agent to filter noise and retain only salient information.
The method works on smaller models (7B), enabling them to handle complex memory tasks that usually require frontier models.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically Policy Optimization)
Retrieval-Augmented Generation (RAG)
LLM Tool Use / Function Calling

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs to stabilize training without a separate value function

BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query

Core Memory: A persistent text summary that stays in the agent's active context (RAM-like)

Episodic Memory: A chronologically organized collection of timestamped events (log-like)

Semantic Memory: A structured collection of discrete factual statements or knowledge (database-like)

LoCoMo: Long-Context Modeling—a setting or benchmark focused on processing very long input sequences

SQuAD: Stanford Question Answering Dataset—a reading comprehension benchmark

Function Calling: The ability of an LLM to output structured text that triggers external code or APIs