Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

📝 Paper Summary

Memory organization Self-evolving Agentic reasoning

Evo-Memory introduces a benchmark for streaming agent tasks and proposes ReMem, an agent that continually refines its memory via a dedicated reasoning action to reuse experience across tasks.

Core Problem

Current LLM memory systems are static, focusing on recalling past dialogue rather than learning strategies from experience, causing agents to repeatedly make the same mistakes in sequential tasks.

Why it matters:

Agents in real-world environments (e.g., coding assistants) face continuous task streams but often fail to adapt strategies, solving similar problems from scratch every time
Existing benchmarks measure factual retention (conversational recall) but ignore whether the agent actually learns to solve problems better (experience reuse)
Standard RAG (Retrieval-Augmented Generation) retrieves context passively but lacks a mechanism to evolve or abstract reasoning strategies over time

Concrete Example: A long-term coding assistant might recall a user's previous library preference (context recall) but fail to remember the specific debugging strategy that fixed a recurring error in that library, forcing it to re-derive the solution in every session.

Key Novelty

ReMem: An Action-Think-Refine Agent Loop

Extends the standard ReAct (Reason + Act) loop by adding a 'Refine' action, allowing the agent to explicitly reason about and update its own memory state
treats memory management not as a hard-coded background process but as a decision-making step within the agent's action space
Introduces a streaming evaluation framework (Evo-Memory) that restructures static datasets into sequential streams to measure improvement over time

Architecture

Conceptual framework of Evo-Memory and the ReMem agent architecture.

Evaluation Highlights

ReMem achieves 0.92 success rate on the BabyAI multi-turn navigation benchmark using Gemini-2.5 Flash, demonstrating strong procedural learning
In single-turn reasoning (AIME, GPQA), ReMem reaches 0.65 average exact match with Gemini-2.5 Flash, consistently improving over static baselines
Performance gains show strong correlation (Pearson r=0.717) with task similarity, confirming that the method effectively exploits structural similarities between tasks

Breakthrough Assessment

8/10

Addresses a critical gap in agentic memory (experience reuse vs. simple recall) with a novel unified framework. The shift from passive RAG to active memory refinement is a significant methodological step.

⚙️ Technical Details

Problem Definition

Setting: Streaming task sequence where an agent processes inputs x_t to produce outputs y_t while maintaining and evolving a memory state M_t

Inputs: Current task input x_t and evolving memory M_t

Outputs: Prediction y_t and updated memory M_{t+1}

Pipeline Flow

Input Processing & Retrieval
Thought Generation (Think)
Action Selection (Act or Refine)
Execution & Memory Update

System Modules

Retriever

Retrieve relevant memory entries based on current input

Model or implementation: Not explicitly specified (generic retriever)

Agent Core (Think)

Analyze task and retrieved memory to formulate a plan

Model or implementation: LLM (Gemini-2.5 or Claude-3.5/3.7)

Action Selector

Choose between performing an environment action ('Act') or updating memory ('Refine')

Model or implementation: LLM (Gemini-2.5 or Claude-3.5/3.7)

Memory Refiner

Meta-reasoning to prune noise, reorganize M_t, and abstract strategies

Model or implementation: LLM (Gemini-2.5 or Claude-3.5/3.7)

Novel Architectural Elements

Inclusion of 'Refine' (memory management) as a discrete action within the agent's MDP action space, competing with 'Act' and 'Think'
Unified 'Search-Predict-Evolve' loop applied to both single-turn and multi-turn tasks

Modeling

Base Model: Gemini-2.5 (Flash, Flash-Lite, Pro) and Claude (3.5-Haiku, 3.7-Sonnet)

Compute: Inference-only evaluation. Specific compute resources not reported.

Comparison to Prior Work

vs. StreamBench: Evo-Memory focuses on 'experience reuse' (strategy) rather than just factual retention
vs. Mem0/SelfRAG: ReMem actively reasons about memory updates ('Refine' action) rather than using passive heuristic updates
vs. AWM: ReMem integrates memory refinement into the online decision loop rather than as an offline induction step

Limitations

Dependency on the underlying LLM's ability to meta-reason about its own memory (requires strong backbones)
Performance gains are highly correlated with task similarity; gains diminish in low-similarity task streams
Specific baseline performance numbers for comparison are visualized in charts but not fully tabulated in the text provided

Reproducibility

Code and configurations are promised to be released ('will release all code') but no URL is currently provided in the text. Evaluation uses public datasets (MMLU-Pro, AlfWorld, BabyAI).

📊 Experiments & Results

Evaluation Setup

Streaming evaluation where agents tackle sequences of tasks and must update memory online

Benchmarks:

MMLU-Pro / GPQA-Diamond (Single-turn reasoning & QA)
AIME-24 / AIME-25 (Mathematical problem solving)
AlfWorld / BabyAI / ScienceWorld (Multi-turn goal-oriented embodied tasks)

Metrics:

Answer accuracy (Exact Match)
Success rate (Goal completion)
Step efficiency (Number of steps to goal)
Sequence robustness
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Correlation plot between ReMem's performance improvement and within-dataset task similarity.

Comparison of step efficiency across four environments.

Main Takeaways

ReMem consistently outperforms baselines in multi-turn settings (BabyAI, ScienceWorld), suggesting that active memory refinement is crucial for long-horizon procedural tasks
Simple retrieval baselines (ExpRAG) are surprisingly effective compared to complex memory architectures, but ReMem provides further gains through iterative refinement
Performance improvements are strongly linked to task similarity; tasks with recurring structures (like PDDL or AlfWorld) see larger benefits from evolving memory than diverse tasks (like GPQA)
Smaller models (e.g., Gemini Flash) show significant benefits from self-evolving memory, indicating it is a viable path for enhancing lighter LLMs

📚 Prerequisite Knowledge

Prerequisites

Understanding of Agentic workflows (ReAct)
Retrieval-Augmented Generation (RAG)
Markov Decision Processes (MDP)

Key Terms

Test-time learning: The ability of a model to learn and adapt its behavior during the inference phase (deployment) without updating its permanent weights

ReAct: Reason+Act—a paradigm where LLMs generate reasoning traces before executing actions

Experience Reuse: The ability to abstract and apply successful strategies from past tasks to new, similar problems, distinct from simply recalling facts

RAG: Retrieval-Augmented Generation—fetching relevant data from external storage to ground LLM generation

MDP: Markov Decision Process—a mathematical framework for modeling decision making where outcomes are partly random and partly under the control of a decision maker

ReMem: The authors' proposed agent framework that integrates reasoning, acting, and memory refinement into a single decision loop

ExpRAG: The authors' baseline method that retrieves and aggregates past experiences using in-context learning