MIRIX: Multi-Agent Memory System for LLM-Based Agents

📝 Paper Summary

Memory organization Memory recall Multi-modal memory

MIRIX is a multi-agent memory system that organizes user data into six distinct structured components (like episodic and semantic memory) to enable long-term, multimodal recall from high-resolution screen activity.

Core Problem

Existing LLM memory systems rely on flat, text-centric storage that fails to handle the scale, structure, and multimodal nature of real-world user data over time.

Why it matters:

Current assistants remain effectively stateless beyond the prompt window, preventing true personalization or evolution with the user.
Storing raw multimodal inputs (like constant screenshots) is prohibitively expensive without effective abstraction layers.
Flat vector databases lack the structural organization needed to distinguish between procedural instructions, specific events, and general facts.

Concrete Example: A standard RAG system stores all historical data in a single flat store. When asked about a specific visual event from weeks ago among 20,000 screenshots, it struggles to retrieve the correct image due to lack of context, whereas MIRIX routes this to specific memory types (Episodic/Resource) for accurate retrieval.

Key Novelty

Six-Component Multi-Agent Memory Architecture

Divides memory into six specialized structures (Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault) rather than a single vector store, mimicking human cognitive organization.
Assigns a dedicated 'Memory Manager' agent to each memory type, coordinated by a Meta Memory Manager, to handle the complexity of routing and updating diverse information.
Introduces a screenshot-based memory pipeline that continuously abstracts visual activity into structured text and low-redundancy logs, enabling recall over months of usage.

Architecture

The overall MIRIX architecture showing the six memory components and the multi-agent management system.

Evaluation Highlights

Achieves 35% higher accuracy than RAG baselines on the new ScreenshotVQA benchmark while reducing storage requirements by 99.9%.
Attains 85.38% accuracy on the LOCOMO long-context benchmark, outperforming the best existing method by 8.0%.
Outperforms long-context baselines on ScreenshotVQA by 410% while using 93.3% less storage.

Breakthrough Assessment

8/10

Strong structural innovation in memory design (6 distinct types managed by agents) and demonstrates massive efficiency gains in multimodal storage (99.9% reduction) while improving accuracy.

⚙️ Technical Details

Problem Definition

Setting: Long-term, open-domain question answering and personalization based on massive streams of multimodal user history (screenshots and text)

Inputs: Continuous stream of user screen activity (screenshots), text interactions, and specific user queries

Outputs: Answers to queries based on long-term history, or execution of tasks using retrieved procedural knowledge

Pipeline Flow

Input Processing (Screen Capture & Stream)
Memory Routing (Meta Memory Manager)
Memory Update (Specialized Managers)
Retrieval & Generation (Chat Agent)

System Modules

Screen Capture Client

Captures screenshots every 1.5s, filters duplicates, and streams to backend

Model or implementation: React-Electron Application

Meta Memory Manager

Routes incoming information to the appropriate memory component manager

Model or implementation: Not explicitly specified (implied LLM-based agent)

Memory Managers (x6)

Manage updates and internal structure for their specific memory type (Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault)

Model or implementation: LLM-based agents

Chat Agent

Interacts with user, performing Active Retrieval by generating topics before answering

Model or implementation: LLM-based agent

Novel Architectural Elements

Taxonomy of six distinct memory types (Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault) each with unique data structures (hierarchical, tree, list)
Multi-agent governance structure where a Meta Manager coordinates six specific Memory Managers
Active Retrieval mechanism where the agent must generate a search topic before answering

Modeling

Base Model: Gemini (for image processing), specific text LLM not explicitly named for agents but implies high-capacity models

Compute: Not reported in the paper

Comparison to Prior Work

vs. Mem0: MIRIX uses 6 structured types vs. Mem0's flattened fact list; MIRIX supports heavy multimodal input
vs. Letta: MIRIX adds Procedural, Semantic, and Knowledge Vault components; employs multi-agent management rather than a single OS-like manager
vs. Zep: MIRIX handles sequential events and full documents (Resource/Episodic) better than pure knowledge graph approaches
+ 1 more
vs. RAG (Baseline): MIRIX abstracts images into text/logs, reducing storage by 99.9% compared to storing raw vectors for all images

Limitations

ScreenshotVQA benchmark is limited to data from only three PhD students.
Reliance on cloud-based API (Gemini) for image processing introduces privacy and latency dependencies.
The specific underlying LLM used for the text-based agents is not explicitly benchmarked against different model sizes.

Reproducibility

A packaged application is mentioned as available for installation, but no direct code repository URL is provided in the text. The paper mentions releasing the obtained file. Benchmark data (ScreenshotVQA) is collected from 3 PhD students but release status is not explicitly detailed as a public URL.

📊 Experiments & Results

Evaluation Setup

Evaluation on visual recall (ScreenshotVQA) and text-only long-context reasoning (LOCOMO)

Benchmarks:

ScreenshotVQA (Multimodal retrieval QA) [New]
LOCOMO (Long-form multi-turn conversation QA)

Metrics:

Accuracy
Storage Requirement
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the newly constructed ScreenshotVQA benchmark showing MIRIX's superiority over RAG and Long-Context baselines in both accuracy and storage efficiency.
ScreenshotVQA	Accuracy vs RAG	Not reported in the paper	Not reported in the paper	Not reported in the paper
ScreenshotVQA	Storage Reduction vs RAG	100%	0.1%	-99.9%
ScreenshotVQA	Accuracy vs Long-Context	Not reported in the paper	Not reported in the paper	Not reported in the paper
ScreenshotVQA	Storage Reduction vs Long-Context	100%	6.7%	-93.3%
Performance on the standard LOCOMO benchmark for text-based long-term memory.
LOCOMO	Overall Accuracy	77.38	85.38	+8.00

Experiment Figures

A chat interface demonstration where the user asks about past activities.

Visualization of Semantic Memory structure.

Main Takeaways

MIRIX achieves state-of-the-art results on LOCOMO (85.38%), approaching the upper bound of long-context models despite using a retrieval-based approach.
The system handles massive multimodal streams (20,000 screenshots) where standard RAG fails due to storage/retrieval noise and long-context models fail due to token limits.
Streaming upload strategies using Gemini reduced latency from 50s (GPT-4) to <5s, enabling real-time memory construction.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Vector Databases
Knowledge of Multi-Agent Systems

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

ScreenshotVQA: A new benchmark introduced in this paper comprising ~20,000 high-resolution screenshots to test multimodal memory recall

LOCOMO: A long-context benchmark requiring reasoning over long-form, multi-turn conversations

Episodic Memory: Memory component storing time-stamped, specific events and experiences

Semantic Memory: Memory component storing abstract facts, concepts, and entities independent of specific events

Procedural Memory: Memory component storing step-by-step instructions and workflows

Gemini API: A multimodal LLM API from Google used here for processing visual data

React-Electron: A framework for building cross-platform desktop applications using web technologies