G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems

📝 Paper Summary

Tree/graph-baesd memory Multi-agent Self-evolving Agentic reasoning

G-Memory enables multi-agent systems to self-evolve by organizing lengthy interaction histories into a three-tier graph hierarchy (insight, query, interaction) for retrieving both abstract wisdom and procedural details.

Core Problem

Current multi-agent systems (MAS) lack self-evolution capabilities because their memory mechanisms are either overly simplistic (ignoring interaction nuances) or lack cross-trial persistence.

Why it matters:

MAS interactions generate up to 10x more tokens than single agents, overwhelming traditional retrieval contexts
Existing systems like MetaGPT only store final results, discarding the valuable collaborative process that explains *why* a solution worked
Without structured memory, agent teams repeat mistakes and fail to improve coordination strategies over time

Concrete Example: In an embodied task 'put a clean cloth in countertop', standard agents might fail by not cleaning the cloth first. G-Memory retrieves a past trajectory where an agent was corrected for putting a dirty egg in a microwave, successfully guiding the new team to clean the item first.

Key Novelty

Three-Tier Hierarchical Graph Memory for MAS

Organizes memory into three levels: fine-grained utterance logs (Interaction), task metadata and status (Query), and abstract lessons (Insight)
Uses bi-directional traversal: moving 'up' to find general strategies (insights) and 'down' to find specific procedural examples (interactions) based on the current task
Updates continuously: successful or failed executions trigger the generation of new insights and graph connections, allowing the collective intelligence to grow

Architecture

The three-tier hierarchical memory architecture of G-Memory.

Evaluation Highlights

+20.89% success rate improvement on ALFWorld (embodied action) using MacNet + Qwen-2.5-14b compared to the original framework
+10.12% accuracy gain on HotpotQA (knowledge reasoning) using DyLAN + GPT-4o-mini compared to DyLAN with no memory
Consumes only 1.4M additional tokens for a 10.32% performance gain on PDDL, whereas MetaGPT-M consumes 2.2M tokens for only a 4.07% gain

Breakthrough Assessment

8/10

Significant performance gains across diverse domains (up to ~20%) and a principled structural solution to the 'long context' problem in MAS interactions. Addresses a critical gap in MAS self-evolution.

⚙️ Technical Details

Problem Definition

Setting: Task-solving Multi-Agent System (MAS) represented as a directed graph where agents communicate to solve a query Q

Inputs: New user query Q and the existing hierarchical memory graph G

Outputs: Final solution a(T) and an updated memory structure containing new insights, queries, and interaction logs

Pipeline Flow

Memory Retrieval: Coarse Retrieval (Query Graph) → Bi-directional Traversal (Insight/Interaction Graphs)
Memory Augmentation: Inject specific memory cues into agents
Execution: MAS solves the task
Memory Update: Trace execution → Update all three graph levels

System Modules

Coarse-grained Retriever (Retrieval)

Identifies relevant historical queries based on embedding similarity

Model or implementation: MiniLM (ALL-MINILM-L6-V2)

Bi-directional Traverser (Retrieval)

Navigates the hierarchy to fetch insights (upward) and interaction details (downward)

Model or implementation: Algorithmic traversal

Graph Sparsifier (Retrieval)

Condenses lengthy interaction logs into core subgraphs relevant to the current task

Model or implementation: LLM-based (e.g., GPT-4o-mini or Qwen)

Memory Filter (Φ)

Selects and formats memory content specific to each agent's role

Model or implementation: LLM-based function

Memory Updater

Updates the graph hierarchy with new insights and logs after task completion

Model or implementation: LLM-based summarization/extraction

Novel Architectural Elements

Three-tier graph hierarchy (Insight, Query, Interaction) specifically designed for MAS collaborative traces
Bi-directional retrieval mechanism combining abstract insight lookups with concrete trajectory condensation
Agent-specific memory projection (Role-based filtering) to prevent context flooding in multi-agent teams

Modeling

Base Model: Evaluated with GPT-4o-mini, Qwen-2.5-7b, and Qwen-2.5-14b

Compute: Not reported in the paper

Comparison to Prior Work

vs. MemoryBank: G-Memory uses a hierarchical graph to abstract insights rather than just retrieving raw chunks
vs. Generative Agents: G-Memory is tailored for multi-agent collaboration trajectories, not single-agent social simulation
vs. MetaGPT/ChatDev: G-Memory stores the *process* (interactions) and *insights*, not just the final output, enabling procedural learning
+ 1 more
vs. MemGPT [not cited in paper]: MemGPT manages OS-level context for single agents; G-Memory manages collective memory for evolving teams

Limitations

Evaluated on a limited set of benchmarks (5 total) primarily focused on reasoning and embodied tasks
Requires an LLM call for graph sparsification and insight generation, adding latency
Hop expansion > 1 degrades performance, suggesting potential noise sensitivity in graph traversal

Reproducibility

Code: https://github.com/bingreeky/GMemory

Code is publicly available at https://github.com/bingreeky/GMemory. The paper specifies hyper-parameters for retrieval (k=2) and graph expansion (1-hop). It uses MiniLM for embeddings. LLM backbones are standard (OpenAI API or Ollama).

📊 Experiments & Results

Evaluation Setup

Integration of G-Memory into three existing MAS frameworks (AutoGen, DyLAN, MacNet) across 5 benchmarks.

Benchmarks:

ALFWorld (Embodied action)
SciWorld (Embodied action / Scientific discovery)
PDDL (Strategic game planning)
HotpotQA (Multi-hop knowledge QA)
FEVER (Fact verification)

Metrics:

Success Rate (SR)
Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance gains using G-Memory with Qwen-2.5-14b backbone integrated into MacNet framework.
ALFWorld	Success Rate	58.21	79.10	+20.89
HotpotQA	Accuracy	30.00	37.55	+7.55
Performance gains using G-Memory with GPT-4o-mini backbone integrated into AutoGen.
ALFWorld	Success Rate	77.61	88.81	+11.20
PDDL	Success Rate	23.53	27.77	+4.24
Comparison against specialized memory baselines (MemoryBank) using DyLAN + GPT-4o-mini.
SciWorld	Success Rate	54.74	65.64	+10.90

Experiment Figures

Cost-Benefit analysis plotting Performance (%) vs Token Cost.

Sensitivity analysis on hop expansion (a), top-k queries (b), and component ablation (c).

Main Takeaways

G-Memory consistently outperforms no-memory and single-agent memory baselines (like Voyager, MemoryBank) across all tested MAS frameworks.
Generic memory baselines often degrade MAS performance (e.g., MemoryBank on PDDL) because they lack role-specific filtering and trajectory abstraction.
G-Memory is token-efficient: on PDDL, it achieved a 10.32% gain with 1.4M tokens, while MetaGPT-M used 2.2M tokens for only a 4.07% gain.
Ablation studies show that both High-level Insights and Fine-grained Interactions are necessary; removing interactions causes a larger drop than removing insights.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Multi-Agent Systems (MAS) workflows
Graph data structures (nodes, edges)
Retrieval-Augmented Generation (RAG) concepts

Key Terms

MAS: Multi-Agent Systems—systems where multiple AI agents collaborate or compete to solve complex tasks

SOP: Standard Operating Procedures—manually defined workflows that dictate how agents interact (e.g., in MetaGPT)

Interaction Graph: The lowest memory level, storing atomic utterances (dialogue logs) between agents during a task

Query Graph: The middle memory level, storing task metadata, status (success/fail), and links to specific interaction logs

Insight Graph: The highest memory level, storing abstract, generalized lessons distilled from past experiences

bi-directional traversal: The process of moving up the graph hierarchy to find general principles and down to find specific examples simultaneously

graph sparsifier: An LLM-based function that extracts only the essential sub-components of a conversation to reduce token usage

cross-trial memory: Memory that persists across different tasks or episodes, allowing the system to learn from history

inside-trial memory: Memory that exists only within the context of solving a single current task

hop expansion: Retrieving not just the directly similar nodes but also their neighbors in the graph to capture broader context