MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs

📝 Paper Summary

Knowledge internalization Memory organization

MEMOIR introduces a sparse residual memory layer that distributes updates across distinct parameter subsets for each edit, using activation-based hashing to retrieve relevant knowledge while minimizing interference.

Core Problem

Existing parametric model editing methods suffer from catastrophic forgetting during long sequences of edits because new updates overwrite parameters storing previous knowledge.

Why it matters:

LLMs frequently require updates to correct outdated information or hallucinations without expensive full retraining
Current methods fail to scale to thousands of sequential edits, degrading performance on both new and old data
Balancing reliability (learning the edit), generalization (handling rephrasings), and locality (not damaging other knowledge) remains unsolved for long edit streams

Concrete Example: When editing a model to update the answer for 'Where was the last Summer Olympics held?', standard methods might overwrite the weights used for a previous edit like 'Who won the 2020 World Cup?', causing the model to forget the earlier fact.

Key Novelty

Sparse Residual Memory with TopHash Retrieval

Introduces a residual memory layer (a parallel fully-connected layer) initialized to zero, preserving the original pre-trained weights
Uses 'TopHash': a mechanism that generates sample-dependent sparse masks based on activation magnitudes, ensuring only a small subset of memory parameters is updated per edit
Retrieves edits at inference by comparing the sparse mask of a new query to stored edit masks; if they match (high overlap), the residual memory is activated, otherwise it is skipped.

Architecture

The workflow of MEMOIR. It shows the original FFN layer (frozen) and the parallel Residual Memory layer (trainable). It illustrates the 'TopHash' process generating a sparse mask from input activations, which is used to select specific columns in the Residual Memory for updating/retrieval.

Evaluation Highlights

Achieves state-of-the-art editing performance on LLaMA-3-8B, maintaining high reliability and locality even after 15,000 sequential edits
Outperforms current methods (like GRACE and MALMEN) on multi-hop reasoning and hallucination correction benchmarks
Significantly reduces catastrophic forgetting compared to ROME and MEMIT, which degrade rapidly as the number of edits increases

Breakthrough Assessment

8/10

Strong contribution to the specific sub-field of lifelong model editing. The use of sparse, activation-based addressing effectively solves the interference problem for large numbers of edits, a major bottleneck in current techniques.

⚙️ Technical Details

Problem Definition

Setting: Lifelong model editing where a model receives a stream of edit pairs (x, y) and must update its parameters to output y for x while preserving performance on past edits and irrelevant data.

Inputs: Input prompt x (e.g., a question)

Outputs: Predicted text y (e.g., the corrected answer)

Pipeline Flow

Input Processing (Mask Generation)
Retrieval & Routing (Mask Matching)
Knowledge Injection (Residual Update)

System Modules

TopHash Generator

Generate a sparse binary mask based on input activations to determine which memory parameters to use

Model or implementation: Algorithmic (Top-k selection + Permutation)

Edit Retriever

Determine if the input corresponds to a known edit or a rephrasing of one

Model or implementation: Nearest Neighbor Search (Hamming Distance)

Residual Memory Layer

Store and output the edited knowledge

Model or implementation: Linear Layer (W_m)

Novel Architectural Elements

Integration of a sparse residual memory layer (W_m) parallel to the FFN projection layer
Dynamic activation routing mechanism that enables/disables the memory module based on mask similarity (Informed Retention)

Modeling

Base Model: Evaluated on LLaMA-3-8B, Mistral-7B, LLaMA-2-7B, and GPT-J-6B

Training Method: Lifelong Model Editing (Sequential Gradient Updates)

Objective Functions:

Purpose: Minimize the difference between model output and target edit label on the specific subset of parameters defined by the mask.

Formally: Standard language modeling loss (Cross-Entropy) constrained to masked parameters.

Adaptation: Residual Memory (W_m) updates only; original weights frozen

Trainable Parameters: Parameters of the residual memory layer W_m corresponding to the active mask indices

Training Data:

ZSRE (Question Answering)
CounterFact (Fact Editing)
MQuAKE (Multi-hop Reasoning)
Hallucination Correction dataset

Key Hyperparameters:

active_indices_k: Not explicitly reported in the paper
threshold_tau: Not explicitly reported in the paper

Comparison to Prior Work

vs. ROME/MEMIT: MEMOIR uses a residual module and sparsity to prevent overwriting, whereas ROME/MEMIT modify original weights and degrade with many sequential edits.
vs. GRACE: MEMOIR uses activation-based hashing (TopHash) for retrieval, which generalizes better to rephrasings than GRACE's rigid codebook matching.
vs. Non-parametric methods (SERAC): MEMOIR integrates knowledge into parameters (residual) rather than keeping a separate external retriever/classifier loop, aiming for better generalization.

Limitations

Reliance on a fixed threshold for mask matching may require tuning for different datasets
Memory overhead of storing the residual matrix W_m (same size as one FFN layer)
Inference latency increases slightly due to mask computation and retrieval step
Performance depends on the quality of the embeddings/activations for the hashing step

Reproducibility

Code: https://github.com/qym7/MEMOIR

Code is publicly available at https://github.com/qym7/MEMOIR. The paper utilizes standard datasets (ZSRE, CounterFact). Detailed hyperparameters like the specific value of k (sparsity) or tau (threshold) are not explicitly detailed in the main text but may be in the codebase.

📊 Experiments & Results

Evaluation Setup

Sequential editing of models with up to 15,000 edits, evaluating reliability, generalization, and locality at intervals.

Benchmarks:

ZSRE (Question Answering / Fact Editing)
CounterFact (Counterfactual edits)
MQuAKE (Multi-hop Reasoning)
Hallucination (Hallucination Correction)

Metrics:

Edit Success Rate (Reliability)
Paraphrase Accuracy (Generalization)
Neighborhood/Locality Accuracy
Portability (Reasoning)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on ZSRE (10k edits) show MEMOIR maintains high performance where others fail.
ZSRE	Score (Composite)	59.8	78.4	+18.6
Results on CounterFact (10k edits) demonstrate superior retention.
CounterFact	Score (Composite)	44.1	59.4	+15.3
Hallucination correction performance shows strong generalization.
Hallucination	Generalization	39.4	90.2	+50.8
Multi-hop reasoning (MQuAKE) results.
MQuAKE-3k	Multi-hop Accuracy	27.6	38.6	+11.0

Experiment Figures

Performance curves (Accuracy vs. Number of Edits) for MEMOIR compared to ROME, MEMIT, and FT-L (Fine-tuning) on LLaMA-3.

Main Takeaways

MEMOIR consistently outperforms baselines (ROME, MEMIT, GRACE, MALMEN) across multiple architectures (LLaMA-3, Mistral, etc.) and benchmarks.
The method scales exceptionally well to large numbers of edits (up to 15k) with minimal degradation, unlike ROME/MEMIT which collapse.
The 'Informed Retention' mechanism (mask matching) drastically improves generalization to paraphrases compared to rigid lookup methods like GRACE.
Locality is well-preserved because the residual memory is deactivated for irrelevant prompts.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer Feed-Forward Networks (FFN) as key-value memories
Familiarity with catastrophic forgetting in continual learning
Basic knowledge of parameter-efficient fine-tuning (PEFT) concepts

Key Terms

residual memory: A dedicated parameter module added to the model (specifically parallel to an FFN layer) to store new knowledge without modifying original weights

catastrophic forgetting: The tendency of a neural network to completely forget previously learned information upon learning new information

LSH: Locality-Sensitive Hashing—a method to hash similar input items into the same 'buckets' with high probability

TopHash: The paper's proposed method for generating sparse masks based on the top-k magnitude activations, acting as a semantic fingerprint for inputs

Hamming distance: A metric measuring the difference between two binary strings (masks), used here to determine semantic similarity between queries

OOD: Out-Of-Distribution—data that is different from the training or edit distribution, used here to test generalization

FFN: Feed-Forward Network—the dense layers within a Transformer block, often hypothesized to store factual knowledge