Evaluation Setup
LOCOMO benchmark evaluating long-term memory across 4 question categories
Benchmarks:
- LOCOMO (Long-term conversational coherence QA)
Metrics:
- LLM-as-a-Judge Score
- p95 Latency
- Token Cost
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| LOCOMO |
LLM-as-a-Judge Score (Relative Improvement) |
1.0 |
1.26 |
+0.26
|
| LOCOMO |
LLM-as-a-Judge Score |
Not explicitly reported in the paper |
Not explicitly reported in the paper |
Not explicitly reported in the paper
|
| System Profiling |
p95 Latency Reduction |
100 |
9 |
-91
|
| System Profiling |
Token Cost Reduction |
100 |
10 |
-90
|
Main Takeaways
- Mem0 consistently outperforms baselines (RAG, Full-Context, Zep, LangChain) across single-hop, temporal, multi-hop, and open-domain questions.
- The graph-based extension (Mem0-graph) provides additional accuracy gains (~2%) by modeling entity relationships, useful for complex reasoning paths.
- The system offers a massive efficiency advantage over full-context methods, making it viable for production use where latency and cost are constraints.