Evaluation Setup
Benchmark consisting of Question Answering (QA), Event Summarization, and Dialogue Generation tasks
Benchmarks:
- LoCoMo QA (Long-term memory recall (Single-hop, Multi-hop, Temporal, Adversarial)) [New]
- LoCoMo Summarization (Event Graph Summarization) [New]
- LoCoMo Generation (Multi-modal dialogue generation) [New]
Metrics:
- Accuracy (for QA)
- ROUGE / BERTScore (for Summarization)
- BLEU / Perplexity (for Generation)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Comparison of dataset statistics showing LoCoMo's significantly larger scale compared to prior benchmarks. |
| LoCoMo vs MSC |
Average Turns |
53.3 |
304.9 |
+251.6
|
| LoCoMo vs MSC |
Average Tokens |
1225.9 |
9209.2 |
+7983.3
|
| Evaluation of model capabilities on the LoCoMo benchmark (Quantitative findings derived from Introduction summary). |
| LoCoMo QA |
Performance Gap vs Human |
100 |
44 |
-56
|
| LoCoMo QA |
Performance Gap vs Human |
100 |
27 |
-73
|
| LoCoMo QA (Adversarial) |
Relative Performance |
100 |
17 |
-83
|
Main Takeaways
- Long-context LLMs and RAG improve memory recall (by 22-66%) but still fail to match human consistency, particularly in temporal reasoning.
- Long-context models are highly brittle to adversarial questions (83% drop), often confusing speakers or hallucinating events when the context is very long.
- RAG offers a balanced compromise between short-context precision and long-context recall, especially when dialogues are structured as database assertions.
- Models struggle to understand the causal progression of events (Event Graph Summarization), lagging significantly behind base baselines when simply given the full context window.