ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

📝 Paper Summary

Agent memory systems Test-time scaling for agents

ReasoningBank distills generalizable strategies from both successes and failures to guide future agent actions, further enhancing performance through memory-aware test-time scaling that converts diverse exploration into better memory.

Core Problem

LLM agents deployed in continuous roles fail to learn from accumulated history, repeating errors and treating every task in isolation.

Why it matters:

Agents discard valuable insights from related problems, leading to stagnant performance over time rather than self-evolution
Existing memory systems mostly store raw trajectories or successful routines, ignoring critical lessons hidden in failure cases
Current approaches lack a mechanism to scale experience effectively at test time, missing the synergy between computation scaling and memory quality

Concrete Example: In a web shopping task, an agent might repeatedly fail to navigate a specific checkout flow because it doesn't remember previous failures. Without ReasoningBank, it retries blindly; with it, it retrieves a 'preventative lesson' from a past failure to avoid the specific error path.

Key Novelty

ReasoningBank & Memory-Aware Test-Time Scaling (MaTTS)

Extracts structured memory items (title, description, reasoning content) from both successful and failed trajectories using self-judged outcomes, rather than just storing raw logs
MaTTS (Memory-aware Test-Time Scaling) uses this memory to guide diverse exploration (parallel or sequential) during test time, generating richer contrastive signals that in turn improve the memory bank itself

Architecture

Overview of ReasoningBank showing the cycle of retrieving memory, executing the task, judging the outcome, and extracting reasoning from both success and failure.

Evaluation Highlights

+8.3% success rate improvement on WebArena using Gemini-2.5-flash compared to memory-free agents
+34.2% relative improvement in success rate on WebArena-Shopping using MaTTS with parallel scaling (k=5) compared to non-scaling baselines
Reduces interaction steps by 1.6 on WebArena and 2.8 on SWE-Bench-Verified, demonstrating improved efficiency alongside effectiveness

Breakthrough Assessment

8/10

Strongly integrates memory with test-time scaling—a timely direction. The focus on learning from failures and distilling reasoning (not just actions) addresses key limitations in current agentic memory systems.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making in a streaming test-time learning setting where queries arrive sequentially without ground truth labels

Inputs: Sequence of task queries Q = {q1, ..., qN} and current observation o_t

Outputs: Action a_{t+1} (web navigation or bash command)

Pipeline Flow

Memory Retrieval: Fetch top-k memory items based on query
Agent Execution: Generate trajectory using retrieved memory as context
Memory Construction: Judge trajectory (Success/Failure) and distill new memory items
Memory Consolidation: Add new items to ReasoningBank

System Modules

Memory Retriever (Memory System)

Identify relevant past experiences to guide current decision making

Model or implementation: Embedding-based similarity search

Agent Policy (Execution)

Execute task interactions with the environment

Model or implementation: Gemini-2.5-flash or Claude-3.7 (ReAct style)

Memory Distiller (Memory System)

Extract structured reasoning/strategies from completed trajectories

Model or implementation: LLM-based extraction (same backbone)

Novel Architectural Elements

Closed-loop integration of memory extraction from *both* success and failure into the test-time scaling process (MaTTS)
Dual-mode scaling: Parallel scaling with self-contrast and Sequential scaling with self-refinement, both feeding the memory bank

Modeling

Base Model: Gemini-2.5-flash and Claude-3.7-Sonnet

📊 Experiments & Results

Evaluation Setup

Streaming test-time learning on web navigation and coding tasks

Benchmarks:

WebArena (General web navigation)
Mind2Web (Web navigation generalization)
SWE-Bench-Verified (Repository-level software engineering)

Metrics:

Success Rate (SR)
Average Interaction Steps (Efficiency)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ReasoningBank improves success rates across different LLM backbones on WebArena compared to memory-free and baseline memory methods.
WebArena	Success Rate	39.8	48.1	+8.3
WebArena	Success Rate	46.2	48.1	+1.9
SWE-Bench-Verified	Success Rate	30.8	34.6	+3.8
Efficiency gains: ReasoningBank reduces the number of steps required to complete tasks.
WebArena	Avg Steps	11.5	9.9	-1.6
MaTTS scaling experiments show that memory enhances the effectiveness of test-time scaling.
WebArena-Shopping	Success Rate (BoN)	40.6	55.1	+14.5
WebArena-Shopping	Success Rate	52.4	55.1	+2.7

Experiment Figures

Impact of scaling factor k on Success Rate for Parallel and Sequential scaling strategies using MaTTS vs baselines.

Comparison of different memory backbones (No Memory, Synapse, AWM, ReasoningBank) under Test-Time Scaling (k=3).

Main Takeaways

ReasoningBank consistently outperforms baselines (No Memory, Synapse, AWM) across WebArena, Mind2Web, and SWE-Bench, validating the benefit of distilling reasoning over raw trajectories.
Learning from failure is effective: extracting preventative lessons contributes to better generalization than success-only memory methods.
Synergy between Memory and Scaling: Better memory guides scaling to higher success (BoN +2.7% vs vanilla), and scaling generates richer experiences that improve memory quality.
Parallel scaling eventually outperforms sequential scaling as compute budget (k) increases, likely due to greater diversity in exploration.

📚 Prerequisite Knowledge

Prerequisites

LLM Agents (ReAct framework)
Test-Time Scaling (TTS)
Retrieval-Augmented Generation (RAG)

Key Terms

MaTTS: Memory-aware Test-Time Scaling—scaling agent exploration (parallel or sequential) while leveraging and updating a memory bank of reasoning strategies

ReasoningBank: A structured memory module that stores distilled reasoning strategies and pitfalls from both successful and failed agent trajectories

LLM-as-a-judge: Using an LLM to evaluate the correctness of an agent's trajectory without ground truth labels

Best-of-N: A scaling strategy where N solutions are generated and the best one is selected (used here as a metric for parallel scaling performance)

Self-refinement: An iterative process where the model critiques and improves its own output within a single trajectory

Self-contrast: Comparing multiple trajectories for the same query to identify consistent patterns and filter out spurious solutions