Evaluation Setup
Sequential editing of models with up to 15,000 edits, evaluating reliability, generalization, and locality at intervals.
Benchmarks:
- ZSRE (Question Answering / Fact Editing)
- CounterFact (Counterfactual edits)
- MQuAKE (Multi-hop Reasoning)
- Hallucination (Hallucination Correction)
Metrics:
- Edit Success Rate (Reliability)
- Paraphrase Accuracy (Generalization)
- Neighborhood/Locality Accuracy
- Portability (Reasoning)
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results on ZSRE (10k edits) show MEMOIR maintains high performance where others fail. |
| ZSRE |
Score (Composite) |
59.8 |
78.4 |
+18.6
|
| Results on CounterFact (10k edits) demonstrate superior retention. |
| CounterFact |
Score (Composite) |
44.1 |
59.4 |
+15.3
|
| Hallucination correction performance shows strong generalization. |
| Hallucination |
Generalization |
39.4 |
90.2 |
+50.8
|
| Multi-hop reasoning (MQuAKE) results. |
| MQuAKE-3k |
Multi-hop Accuracy |
27.6 |
38.6 |
+11.0
|
Main Takeaways
- MEMOIR consistently outperforms baselines (ROME, MEMIT, GRACE, MALMEN) across multiple architectures (LLaMA-3, Mistral, etc.) and benchmarks.
- The method scales exceptionally well to large numbers of edits (up to 15k) with minimal degradation, unlike ROME/MEMIT which collapse.
- The 'Informed Retention' mechanism (mask matching) drastically improves generalization to paraphrases compared to rigid lookup methods like GRACE.
- Locality is well-preserved because the residual memory is deactivated for irrelevant prompts.