AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation

📝 Paper Summary

Memory organization Linear memory Agentic reasoning

AtomMem replaces static memory workflows with a learnable policy that dynamically executes atomic Create, Read, Update, and Delete operations via reinforcement learning to adapt to task-specific information density.

Core Problem

Most agent memory systems rely on static, hand-crafted workflows (like mandatory updates or fixed forgetting schedules) that cannot adapt to the fluctuating information density of complex, long-horizon tasks.

Why it matters:

Rigid 'one-size-fits-all' rules lead to redundant operations when information is sparse or premature forgetting when early cues are critical for later reasoning
Continuous memory fusion strategies risk obscuring fine-grained details needed for precision-sensitive tasks
Existing methods like MemAgent allow content optimization but still enforce constrained workflows (e.g., mandatory updates), wasting cognitive resources

Concrete Example: In a long-context QA task where critical information appears early but is followed by noise, a static memory system with a fixed forgetting schedule might discard the early clue. Conversely, a system enforcing updates at every step will waste resources processing the noise, diluting the memory store.

Key Novelty

Deconstructed Atomic Memory Operations optimized via RL

Reframes memory management as a sequential decision-making problem rather than a fixed pipeline, utilizing atomic CRUD (Create, Read, Update, Delete) actions
Uses Group Relative Policy Optimization (GRPO) to train the agent to autonomously decide when to modify memory based on task context, rather than following heuristic rules

Architecture

The overall AtomMem framework showing the interaction between the agent, environment, and memory storage via atomic operations.

Evaluation Highlights

Outperforms prior static-workflow memory methods by approximately 2-5 percentage points across HotpotQA, 2WikiMultihopQA, and MuSiQue benchmarks
Demonstrates robust scalability in Needle-in-a-Haystack tasks, maintaining a significant performance lead even when context is extended to 800 documents (4x training size)
RL training improves performance by nearly 10 percentage points on average compared to the SFT initialization, verifying the benefit of dynamic memory policies

Breakthrough Assessment

8/10

Strong conceptual shift from static workflows to fully learnable atomic operations. The consistent performance gains and successful application of RL to memory management suggest a promising new direction for agentic memory.

⚙️ Technical Details

Problem Definition

Setting: Partially Observable Markov Decision Process (POMDP) where the agent manages internal storage to bridge information gaps

Inputs: Streaming environmental observations (chunks of text) and current memory state

Outputs: Joint action comprising task-specific execution and atomic memory operations (Create, Read, Update, Delete)

Pipeline Flow

Input Processing: Stream Chunk + Mandatory Scratchpad Retrieval
Optional Retrieval: Conditional Read from Vector DB
Policy Execution: LLM generates Task Action + Memory Operations
Memory Execution: Apply CRUD to Vector DB/Scratchpad

System Modules

Policy Model

Decides both task answers and memory operations based on current context

Model or implementation: Qwen3-8B

Memory Storage (Storage)

Stores persistent information chunks accessible via semantic retrieval

Model or implementation: Vector Database (Qwen3-embedding-0.6B)

Scratchpad (Storage)

Maintains global task state and is mandatorily retrieved at every step

Model or implementation: Text Buffer

Novel Architectural Elements

Decomposition of memory management into learnable atomic CRUD operations within the LLM's action space
Hybrid retrieval mechanism combining mandatory scratchpad access with policy-controlled selective retrieval (Read actions)

Modeling

Base Model: Qwen3-8B

Training Method: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)

Objective Functions:

Purpose: Maximize expected return (task success).

Formally: Advantage computed via Group Relative Policy Optimization (GRPO) based on terminal reward.
Purpose: Optimize output sequence likelihood.

Formally: Policy gradient objective weighted by advantage, distributed across all output tokens.

Training Data:

SFT: 4K prompt-completion pairs sampled from HotpotQA via rejection sampling using DeepSeek-V3
RL: Task-specific training on HotpotQA, 2WikiMultihopQA, and MuSiQue individually

Key Hyperparameters:

retrieval_top_k: 6
chunk_size: 4k tokens
embedding_model: Qwen3-embedding-0.6B

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemAgent: AtomMem allows the agent to skip updates when information is irrelevant, whereas MemAgent enforces an update every step.
vs. MemoryBank: AtomMem learns the policy from data via RL, whereas MemoryBank uses hand-crafted heuristic rules.
vs. SCM: AtomMem covers the full CRUD action space, whereas SCM focuses on specific high-level operations like summarization or pruning.

Limitations

Dependent on the underlying LLM's capability to understand and output structured memory commands
RL training requires task-specific interactions and rewards, which may be costly compared to static heuristics
Performance drops significantly if either the scratchpad or vector storage is removed, indicating high dependency on the hybrid architecture

Reproducibility

Code: https://github.com/RUCBM/AtomMem

Code is publicly available at https://github.com/RUCBM/AtomMem. SFT data generation used DeepSeek-V3. Embedding model is Qwen3-embedding-0.6B. Specific RL hyperparameters (learning rate, batch size) are mentioned as being in Appendix A (not provided in text snippet).

📊 Experiments & Results

Evaluation Setup

Multi-hop Question Answering with augmented long-context and multi-question settings

Benchmarks:

HotpotQA (Multi-hop QA (Augmented with irrelevant docs))
2WikiMultihopQA (Multi-hop QA (Augmented))
MuSiQue (Multi-hop QA (Augmented))

Metrics:

Exact Match (EM)
Statistical methodology: Data points averaged over three repeated runs for numerical stability

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation studies demonstrate the necessity of specific memory components and the Update operation. (Note: Main comparison table values not extractable from text snippet, only deltas).
Average across benchmarks	Performance drop	0	-5	-5
Average across benchmarks	Performance	High	Lower	Negative

Experiment Figures

Evolution of memory operation frequency during RL training.

Main Takeaways

RL training transforms the agent's behavior from passive reading (high Read usage) to active management (high Create/Update/Delete usage), correlating with higher performance.
The 'Update' operation is critical; removing it causes substantial performance drops, whereas removing 'Delete' has marginal impact in the tested scenarios.
The combination of Scratchpad and Vector Storage is essential; removing both causes catastrophic failure (>40 point drop), and they serve complementary roles that cannot be substituted by one another.
The system is robust to chunk size variations due to the base model's long-context capabilities, but sensitive to retrieval count (K), requiring sufficient context (K=6) for multi-hop reasoning.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (POMDP, Policy Optimization)
Large Language Models (Context windows, RAG)
Database operations (CRUD)

Key Terms

CRUD: Create, Read, Update, Delete—the four fundamental atomic operations for persistent storage management

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm used here to refine the memory policy by optimizing task-level success

SFT: Supervised Fine-Tuning—initial training phase using labeled examples to teach the model the API schema and basic behaviors

Scratchpad: A centralized memory entry that is mandatorily retrieved at every step to maintain global task state, complementing the selective vector storage

POMDP: Partially Observable Markov Decision Process—a mathematical framework for decision-making where the agent cannot directly observe the full state of the environment

Needle-in-a-Haystack: An evaluation setting where a small piece of critical information ('needle') is hidden within a large amount of irrelevant text ('haystack')

Vector Database: A storage system that indexes data via embedding vectors, used here to implement the 'Read' operation via semantic similarity