UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

📝 Paper Summary

Self-evolving Agentic reasoning Linear memory

UMEM improves agent self-evolution by jointly training a memory optimizer to extract and manage insights using a neighborhood-based reward that ensures memories generalize to semantically similar tasks.

Core Problem

Existing self-evolving agents optimize memory management (storage/retrieval) but treat memory extraction as a static process, leading to the accumulation of instance-specific noise that fails to generalize.

Why it matters:

Accumulation of instance-specific noise causes progressive memory pollution, degrading performance over long interactions
Management misalignment occurs when extracted memories do not fit the management policy, rendering even optimal retrieval strategies ineffective
Prior works like MemRL and EvolveR fail to model cross-task generalization, resulting in agents that memorize shortcuts rather than principles

Concrete Example: In a reasoning task, a static extractor might memorize specific numbers from a math problem (a shortcut). When a similar problem with different numbers appears, the agent retrieves this specific, irrelevant detail, leading to failure, whereas UMEM extracts the underlying formula.

Key Novelty

Unified Memory Extraction and Management (UMEM)

Jointly optimizes the creation (extraction) and storage (management) of memories using a trainable 'Mem-Optimizer' rather than static prompts
Introduces Semantic Neighborhood Modeling: evaluates memory quality not just on the current query, but on a cluster of similar queries to enforce generalization
Uses a Marginal Utility Reward that calculates the net benefit (success gain + efficiency) of a memory update across the semantic neighborhood

Architecture

The UMEM framework workflow, illustrating the interaction between the frozen Agent Executor and the trainable Mem-Optimizer.

Evaluation Highlights

Achieves 82.84% Success Rate on ALFWorld using UMEM-Qwen3-4B with GPT-5.1 as executor, outperforming baselines
Demonstrates robust scaling: removing Semantic Neighborhood Modeling degrades AIME performance by 10.0 points (51.67 to 41.67) for GPT-5.1
Maintains monotonic performance growth during continuous evolution, avoiding the degradation seen in baselines like ReMem which optimize management in isolation

Breakthrough Assessment

8/10

Significant methodological shift from static to optimized memory extraction. Strong empirical gains on future-looking benchmarks (ALFWorld, AIME) and validation of the 'joint optimization' hypothesis.

⚙️ Technical Details

Problem Definition

Setting: Self-evolving agent system with frozen executor parameters and evolvable non-parametric memory bank

Inputs: Query q and current Memory Bank B_t

Outputs: Answer y_hat and updated Memory Bank B_{t+1}

Pipeline Flow

Execution Group: Retrieval → Frozen Executor → Trajectory Generation
Optimization Group: Mem-Optimizer → Memory Action → Marginal Utility Calculation → GRPO Update

System Modules

Memory Bank

Stores key-value pairs where keys are queries and values are distilled insights/memories

Model or implementation: Non-parametric External Memory

Agent Executor

Performs the actual task (reasoning/interaction) conditioned on retrieved memories

Model or implementation: Frozen LLM (e.g., Qwen3-8B, GPT-5.1)

Mem-Optimizer

Extracts insights from execution trajectories and generates memory management actions

Model or implementation: Trainable LLM (Llama-3.2-1B or Qwen3-4B)

Novel Architectural Elements

Separation of frozen Executor (inference engine) and trainable Mem-Optimizer (evolution engine)
Integration of Semantic Neighborhood search into the reward calculation loop (not just for retrieval)

Modeling

Base Model: Llama-3.2-1B-Instruct and Qwen3-4B-Instruct (Policy Models)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize generalizable utility of memories.

Formally: Maximize joint objective r_final = r_fmt + r_g (format reward + marginal utility reward)
Purpose: Measure memory utility.

Formally: Sum of Success Gain (did it fix errors?) and Efficiency Regularization (did it reduce tokens while keeping correctness?) averaged over the semantic neighborhood

Adaptation: Full fine-tuning of the Mem-Optimizer policy

Trainable Parameters: Parameters of the Mem-Optimizer (phi)

Training Data:

Derived from MMLU dataset
~2,000 queries sampled
Semantic neighborhoods constructed via Top-N (N=3) retrieval from training set

Key Hyperparameters:

semantic_neighborhood_size_N: 3
top_k_retrieval_K: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. ReMem: ReMem optimizes management only; UMEM jointly optimizes extraction and management, preventing the 'garbage in' problem
vs. Memp: Memp uses hand-crafted/static rules for extraction; UMEM learns the extraction policy via RL
vs. Generative Agents [not cited in paper]: Generative Agents use heuristic reflection for memory synthesis; UMEM uses GRPO-optimized extraction based on utility rewards

Limitations

Depends on the quality of the frozen executor; weaker executors provide lower-quality trajectories for distillation
Semantic Neighborhood Modeling increases computational cost during training due to multiple evaluations per step
Requires a predefined semantic encoder (e.g., BGE-M3) for neighborhood construction

Reproducibility

Code and models will be publicly released (not yet available). Training data is derived from public MMLU. Uses future/hypothetical models (Qwen3, GPT-5.1) which implies dependencies on specific future releases or internal versions.

📊 Experiments & Results

Evaluation Setup

Streaming protocol (zero-reset) where agents must evolve memory continuously across a sequence of tasks

Benchmarks:

AIME (Math reasoning)
GPQA-Diamond (Scientific reasoning)
HLE (Multidisciplinary complex reasoning)
HotpotQA (Multi-hop QA)
ALFWorld (Embodied interaction/planning)

Metrics:

Exact Match (EM)
Cumulative Success Rate (CSR)
Progress Rate
Average Steps
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation studies demonstrate the critical importance of Semantic Neighborhood Modeling for generalization.
AIME	Exact Match	41.67	51.67	+10.00
AIME	Exact Match	55.00	58.33	+3.33
Component analysis shows that optimizing memory extraction is more impactful than optimizing management alone.
Average across tasks	Performance Score	Not reported in the paper	Not reported in the paper	Not reported in the paper
Main performance results on embodied tasks.
ALFWorld	Success Rate	Not reported in the paper	82.84	Not reported in the paper

Experiment Figures

Cumulative accuracy/success rate curves over a streaming task setup.

Long-horizon continual interaction on ALFWorld (10 epochs).

Main Takeaways

Joint optimization of memory extraction and management is superior to optimizing management alone (as in ReMem), preventing the accumulation of noise.
Semantic Neighborhood Modeling is essential; without it (N=1), agents overfit to specific instances, causing significant performance drops on reasoning tasks.
UMEM enables 'monotonic growth' in continuous learning settings, avoiding the degradation often seen in self-evolving agents over long horizons.
Performance gains scale with the strength of the frozen executor (GPT-5.1 benefits more than Qwen3-8B), suggesting better reasoning traces lead to better distilled memories.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO/GRPO concepts)
Retrieval-Augmented Generation (RAG)
Agentic workflows (planning, execution, reflection)

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of sampled outputs to stabilize training

Semantic Neighborhood Modeling: Constructing a cluster of semantically similar queries to evaluate whether a memory update generalizes beyond the specific instance that generated it

Marginal Utility Reward: A reward function measuring the incremental benefit (success or efficiency) a memory update provides compared to a reference execution without that update

Mem-Optimizer: The trainable module in UMEM responsible for extracting insights from trajectories and deciding how to update the memory bank

Self-evolving Agent: An AI agent that improves its performance over time by updating its external memory or parameters based on experience

Online Memory Evolution: Updating the memory bank dynamically during the training process with the best-rated rollouts, forcing the agent to adapt to a changing memory state

CSR: Cumulative Success Rate—a metric tracking the total number of successful tasks over a sequence of interactions

Instance-Specific Noise: Details in a memory that are unique to one specific example and do not help (or even hurt) when applied to similar but different problems