MEGa: Memorization and knowledge injection in gated LLMs

📝 Paper Summary

Memory internalization Continual Learning (CL)

MEGa enables LLMs to continually learn and recall specific episodic memories by storing each memory in a dedicated, query-gated LoRA adapter rather than overwriting shared weights.

Core Problem

Standard fine-tuning for knowledge injection causes catastrophic forgetting of previous memories, while RAG relies on external buffers rather than internalizing knowledge like biological long-term memory.

Why it matters:

Current methods struggle to sequentially add new memories without degrading general language capabilities or forgetting old data
RAG models external environments rather than the biological process of long-term memory formation via synaptic changes
Hebbian learning rules in classical RNNs fail to store highly correlated, semantic-rich data at scale

Concrete Example: When a model sequentially learns stories about fictional characters (e.g., a trip to the Alps), standard fine-tuning (Full or LoRA) forgets earlier stories as it learns new ones. MEGa retains access to the 'Alps' story even after learning subsequent, unrelated events.

Key Novelty

Memory Embedded in Gated LLMs (MEGa)

Assigns a unique, trainable LoRA adapter to each new memory (document) during the fine-tuning phase
Freezes the adapter after training and stores a 'context key' (embedding) derived from the document
At inference, a gating mechanism compares the user query to all context keys and activates only the relevant adapters via a weighted sum

Architecture

Schematic of MEGa's fine-tuning and inference process. It shows how distinct LoRA adapters are created for each memory and how a query-based gating mechanism selects them during inference.

Evaluation Highlights

Achieves >90% recall cosine similarity on Fictional Character dataset after 50 sequential tasks, while baselines (LoRA, Full) drop to <10%
Maintains near-perfect Question Answering accuracy (~100%) on Wikipedia events, significantly outperforming regularization baselines (EWC, L2) which degrade to ~20-40%
Preserves general language ability (MMLU score ~66%) effectively, comparable to the frozen base model, whereas full fine-tuning degrades to ~38%

Breakthrough Assessment

7/10

Strong empirical results on mitigating catastrophic forgetting for sequential memory injection. The architecture is novel for this specific use case, though the scalability of storing one adapter per memory is a potential limitation.

⚙️ Technical Details

Problem Definition

Setting: Continual learning where a model sequentially processes datasets D_1...D_n (memories), then answers queries requiring recall of specific D_i

Inputs: Natural language query q (related to a specific past memory or general knowledge)

Outputs: Generated text response (either exact story reconstruction or answer to a question)

Pipeline Flow

Query Embedding Extraction
Gating Mechanism (Similarity Calculation)
Adapter Activation (Weighted Sum)
Token Generation

System Modules

Base LLM

Provide pretrained language capabilities and initial embeddings

Model or implementation: Llama-3.1-8B-Instruct

Gating Mechanism

Calculate relevance scores between query and stored memory keys

Model or implementation: Cosine similarity + Softmax

MEGa Adapters

Inject memory-specific knowledge into the forward pass

Model or implementation: Collection of LoRA adapters (Rank=128)

Novel Architectural Elements

One-LoRA-per-memory storage strategy where new adapters are created for each new data sample
Inference-time global gating where adapter influence is determined by semantic similarity between the query and the memory's original content key

Modeling

Base Model: Llama-3.1-8B-Instruct

Training Method: Continual Fine-tuning with Gated LoRA Adapters

Objective Functions:

Purpose: Minimize prediction error on the memory text.

Formally: Standard causal language modeling loss (Next Token Prediction) minimized over the specific LoRA adapter parameters for the current memory.

Adaptation: LoRA (rank=128) applied to MLP layers

Trainable Parameters: LoRA matrices A and B for each new memory (Base model frozen)

Training Data:

Fictional Character Dataset: 50 synthetic stories (avg 41.93 words) generated by GPT-4.5
Wikipedia 2024 Events: 1,000 articles sampled from 2024 events (avg 41.55 words)

Key Hyperparameters:

rank: 128
gating_beta: 1 (main experiments), 0.1 (compositional)
learning_rate: Not reported in the paper
+ 1 more
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Continual LoRA: MEGa keeps adapters separate and gates them dynamically, preventing interference
vs. RAG: MEGa internalizes knowledge into weights rather than using context window (aiming for biological plausibility)
vs. MELO: MEGa uses global semantic similarity for gating rather than local activation levels
+ 1 more
vs. O-LoRA [not cited in paper]: O-LoRA learns orthogonal subspaces for tasks sequentially; MEGa explicitly allocates new parameters per memory instance

Limitations

Scalability concerns: Storing a distinct LoRA adapter for every single memory event grows linearly with the number of memories
Inference cost: Computing the weighted sum of all adapters requires iterating through all stored memories (though potentially sparse)
Limited to offline setting: The current setup assumes memories arrive as distinct samples to be trained on one by one

Reproducibility

No replication artifacts (code, weights, data) are provided or linked in the paper. The datasets are described as generated by GPT-4.5/Wikipedia crawling, but specific seeds or the exact dataset files are not linked.

📊 Experiments & Results

Evaluation Setup

Continual learning on sequences of 50 (Fictional) or 1000 (Wikipedia) text samples

Benchmarks:

Fictional Character Dataset (Synthetic episodic memory recall and QA) [New]
Wikipedia 2024 Events (Real-world event knowledge injection) [New]
MMLU (General knowledge retention)

Metrics:

Recall Cosine Similarity (vector similarity between generated story and ground truth)
QA Accuracy (judged by GPT-o3-mini)
Log Probability (of ground truth answer)
MMLU Macro Accuracy
Statistical methodology: Experiments repeated on 20 dataset partitions; mean and standard deviation reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MEGa significantly outperforms continual learning baselines on the Fictional Character dataset, maintaining high recall and QA accuracy where others fail.
Fictional Character Dataset	Recall Cosine Similarity	0.10	0.95	+0.85
Fictional Character Dataset	QA Accuracy (GPT Judge)	0.10	0.98	+0.88
On the Wikipedia dataset, MEGa maintains high performance while baselines degrade, though RAG remains the ceiling.
Wikipedia 2024 Events	QA Accuracy (GPT Judge)	0.38	0.98	+0.60
Wikipedia 2024 Events	Log Prob	-0.20	-0.05	+0.15
MEGa preserves general capabilities (MMLU) better than full fine-tuning approaches.
MMLU	Macro Accuracy	0.38	0.66	+0.28

Experiment Figures

Performance curves (Recall Cosine, QA Accuracy, Log Prob, MMLU) over 50 sequential tasks on the Fictional Character dataset.

Performance on Wikipedia 2024 Events dataset, comparing MEGa against RAG and various fine-tuning baselines.

Main Takeaways

MEGa effectively eliminates catastrophic forgetting for up to 50 sequential tasks in the fictional dataset, matching Batch learning performance.
The proposed 'iRAG' (Internal RAG) strategy, which recalls the story before answering, improves QA accuracy compared to answering directly from weights.
Regularization methods (EWC, L2) fail to prevent forgetting in this sequential knowledge injection setting, often performing no better than standard fine-tuning.
MEGa enables compositional reasoning (combining two memories) better than baselines, though tuning the gating parameter (beta) is required for optimal mixing.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Low-Rank Adaptation (LoRA)
Familiarity with Continual Learning and Catastrophic Forgetting
Basic knowledge of Transformer architecture and Mixture of Experts (MoE) concepts

Key Terms

LoRA: Low-Rank Adaptation—a technique to fine-tune large models by training small rank-decomposition matrices while freezing the main weights

Catastrophic Forgetting: A phenomenon where a neural network abruptly forgets previously learned information upon learning new information

Gated Network: A neural network architecture where specific paths or modules are activated or suppressed based on the input

MMLU: Massive Multitask Language Understanding—a benchmark evaluating models on tasks ranging from elementary math to professional law

iRAG: Internal RAG—a proposed method where the model first self-generates a relevant memory (recall) and then uses that generation as context to answer a question

EWC: Elastic Weight Consolidation—a regularization technique that penalizes changes to important parameters to prevent forgetting

Context Key: A stored vector representation of a memory (fine-tuning data) used to calculate similarity with user queries during inference