MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

📝 Paper Summary

Memory organization Linear memory management

MemPO enables agents to autonomously manage their own memory by incorporating memory-specific rewards into Group Relative Policy Optimization, ensuring retained information aligns with task objectives.

Core Problem

Existing agent memory methods rely on external storage or passive RAG, preventing the model from proactively learning which information is crucial for the specific task.

Why it matters:

Fixed context windows limit the number of interactions an agent can have before crashing or forgetting
External retrieval systems (RAG) often retrieve based on semantic similarity rather than task utility, fetching irrelevant data
Long contexts cause 'lost in the middle' phenomenon and high token costs, hindering deployment in complex real-world scenarios

Concrete Example: In a multi-step research task, a standard ReAct agent keeps appending every search result to the context. Eventually, the context exceeds the limit or the model gets distracted by early irrelevant searches, failing to answer the final question. RAG might retrieve a superficially similar but useless fact. MemPO, however, summarizes only the critical findings into a <mem> block at each step, discarding the rest.

Key Novelty

Self-Memory Policy Optimization (MemPO)

Treats memory management as an intrinsic action (<mem> token generation) optimized via Reinforcement Learning rather than a fixed external module
Introduces a dual-reward system: a sparse trajectory-level reward for final answer correctness and a dense step-level reward for memory quality
Calculates memory quality by measuring how much the generated memory increases the model's conditional probability of generating the correct ground-truth answer

Architecture

The MemPO interaction paradigm showing the cyclical process of memory update, reasoning, and tool use.

Evaluation Highlights

+25.98% absolute F1 gain over the base model on long-horizon benchmarks
+7.1% absolute F1 gain over the previous SOTA baseline (MEM1)
Reduces token usage by 67.58% compared to the base model and 73.12% compared to previous SOTA

Breakthrough Assessment

8/10

Significant efficiency gains and performance improvements on long-horizon tasks by successfully applying RL directly to the memory management process, a difficult credit assignment problem.

⚙️ Technical Details

Problem Definition

Setting: Long-horizon Question Answering where an agent interacts with an environment over T steps

Inputs: Natural language question q

Outputs: Final answer a_pred

Pipeline Flow

Agent Input Processing
Memory Generation (<mem>)
Reasoning & Tool Use (<think>, <tool_call>)
Environment Feedback (<information>)
Context Update (Replace history with <mem>)

System Modules

Memory Generator

Summarize effective information from previous outputs into a concise block

Model or implementation: Qwen2.5-7B (fine-tuned)

Reasoning & Action Engine

Perform reasoning and generate tool calls based on current memory

Model or implementation: Qwen2.5-7B (fine-tuned)

Tool Executor

Execute tool calls and return results

Model or implementation: External Tools (Wiki Search / Web Search)

Novel Architectural Elements

Intrinsic memory action space (<mem>) integrated directly into the agent's autoregressive generation loop
Context replacement mechanism where the previous full history is discarded and replaced by the generated <mem> block for the next inference step

Modeling

Base Model: Qwen2.5-7B

Training Method: Group Relative Policy Optimization (GRPO) with specialized memory rewards

Objective Functions:

Purpose: Maximize expected return of the policy using importance sampling and KL regularization.

Formally: J(θ) = E[min(ratio * A, clip(ratio) * A) - β * D_KL]
Purpose: Calculate token-level advantage by combining trajectory and memory rewards.

Formally: A_{i,k} = A^T + A^M(τ_i(s^{mem}_t)) if token k is in memory, else A^T
Purpose: Evaluate memory quality via conditional probability of ground truth answer.

Formally: R^M(τ_i(s^{mem}_t)) = P(s^{ans} | τ_i(s^{mem}_t)) - P(s^{ans} | τ_i(s_{<t}))

Training Data:

2-objective tasks synthesized from HotpotQA and NQ validation sets
Random subset from HotpotQA and NQ

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 128
group_size_N: 16
+ 1 more
max_interaction_rounds: 16

Compute: Not reported in the paper

Comparison to Prior Work

vs. ReAct: Uses compressed blocks instead of full linear history
vs. A-MEM: Proactively summarizes/compresses based on task utility rather than passive retrieval similarity
vs. MEM1: Explicitly rewards memory content based on conditional probability of the answer, rather than just final outcome
+ 1 more
vs. TiM [not cited in paper]: TiM (Think-in-Memory) also maintains a memory state, but MemPO uses RL to optimize the memory content specifically for answer probability rather than just maintaining state consistency

Limitations

Memory information content naturally differs across rollout steps, potentially introducing bias in group-based advantage calculation
Relies on ground truth answers for reward calculation, limiting applicability to open-ended tasks without clear answers
Generalization to diverse real-world settings needs further investigation beyond QA benchmarks

Reproducibility

Code availability is not provided. The method relies on synthetic data generation using GPT-4.1. Base model is Qwen2.5-7B.

📊 Experiments & Results

Evaluation Setup

Multi-turn Question Answering with tool use (Wiki Search and Web Search)

Benchmarks:

HotpotQA (Multi-objective) (Multi-hop reasoning QA) [New]
Natural Questions (NQ) (Open-domain QA)
TriviaQA (Complex compositional QA)
GAIA (General AI Assistant (Deep Research))
Frames (Multi-perspective reasoning)

Metrics:

F1 score
Exact Match (EM)
Total Tokens (TT)
Peak Tokens (PT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on multi-objective tasks (combining HotpotQA/NQ) with increasing difficulty (4 to 10 objectives).
HotpotQA (10-objective)	F1	39.6	46.7	+7.1
HotpotQA (10-objective)	Total Tokens (TT)	73431	24982	-48449
Performance on Deep Research Benchmarks (OOD generalization).
GAIA	Accuracy	32.00	32.00	0.00
Frames	Accuracy	36.80	55.60	+18.80
Ablation study isolating the impact of the memory-specific reward.
Multi-objective task	F1	32.0	44.0	+12.0

Experiment Figures

Analysis of conditional probability of the correct answer given the generated memory across interaction steps.

Distribution of memory conditional probabilities and their relationship to answer accuracy.

Main Takeaways

MemPO consistently achieves higher F1 scores across diverse benchmarks (HotpotQA, TriviaQA, GAIA) compared to both ReAct and other memory-augmented baselines.
The method drastically reduces token consumption (both total and peak) by compressing history into concise memory blocks, making it far more efficient.
Ablation studies confirm that the memory-specific reward (based on answer probability) is crucial; without it, performance drops significantly.
The approach generalizes well to Out-of-Distribution (OOD) tasks like GAIA and Frames, even when trained only on QA datasets.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Group Relative Policy Optimization (GRPO)
Retrieval-Augmented Generation (RAG)
ReAct Agent paradigm

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs generated from the same input, avoiding the need for a separate value network

ReAct: Reason+Act—a paradigm where agents generate reasoning traces and task-specific actions in an interleaved manner

RAG: Retrieval-Augmented Generation—enhancing model responses by retrieving relevant documents from an external source

F1 score: A metric measuring the overlap between the predicted answer and the ground truth, balancing precision and recall

EM: Exact Match—a strict metric requiring the predicted answer to be identical to the ground truth

Conditional Probability: The probability of an event occurring given that another event has occurred; here, the probability of the correct answer given the memory content

Credit Assignment: The problem in RL of determining which past action is responsible for a current reward

Trajectory: The sequence of states and actions taken by an agent from the start of a task to its completion

OOD: Out-of-Distribution—data that is different from what the model saw during training