Evaluation Setup
Evaluated on diverse long-context benchmarks covering QA, summarization, and coding.
Benchmarks:
- LongBench (Multi-task benchmark (QA, Summarization, Code, Few-shot))
- InfiniteBench (Ultra-long context benchmark (up to 100k+ tokens))
- LEval (Long-document evaluation)
Metrics:
- F1 Score
- ROUGE-L
- Accuracy
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| MemoRAG outperforms baselines significantly on summarization tasks where global context is required. |
| LongBench (En.Sum) |
ROUGE-L |
25.17 |
55.88 |
+30.71
|
| InfiniteBench (En.Sum) |
ROUGE-L |
13.82 |
45.08 |
+31.26
|
| MemoRAG also shows superiority in specific retrieval tasks (En.MC) compared to standard RAG and long-context models. |
| InfiniteBench (En.MC) |
Accuracy |
23.32 |
55.48 |
+32.16
|
| InfiniteBench (En.MC) |
Accuracy |
22.89 |
55.48 |
+32.59
|
| Comparison against advanced RAG methods shows MemoRAG's effectiveness. |
| LongBench (En.Sum) |
ROUGE-L |
21.71 |
55.88 |
+34.17
|
| LongBench (En.Sum) |
ROUGE-L |
25.26 |
55.88 |
+30.62
|
Main Takeaways
- Standard RAG and even GPT-4 struggle heavily with 'En.Sum' (Summarization) and 'En.MC' (Multiple Choice) in InfiniteBench, likely due to the need for global context awareness which chunk-based retrieval lacks.
- MemoRAG is particularly dominant in summarization tasks (En.Sum), suggesting the global memory effectively captures high-level narrative arcs that simple retrieval misses.
- The method generalizes well to QA tasks (En.QA), maintaining competitive or superior performance compared to full-context models.
- Efficiency analysis shows MemoRAG is much faster (time-to-first-token and decoding speed) than processing full contexts directly.