MemLLM: Finetuning LLMs to Use An Explicit Read-Write Memory

📝 Paper Summary

Memory recall Memory organization

MemLLM fine-tunes a language model to interact with an external, structured database via explicit read/write API calls, enabling interpretable knowledge storage and editing without retraining parameters.

Core Problem

Standard LLMs rely on implicit parametric memory, making it difficult to update facts, memorize rare events, and interpret stored knowledge, while RAG methods often lack structure for precise editing.

Why it matters:

Parametric knowledge degrades over time and requires expensive retraining or unreliable editing methods to update.
Unstructured RAG storage complicates atomic fact editing (changing one fact might require modifying many documents to prevent contradictions).
Lack of interpretability in parametric memory makes preventing hallucinations and verifying stored facts challenging.

Concrete Example: When a fact changes (e.g., a Prime Minister changes), a standard LLM might hallucinate or output outdated info. Parametric editing (like ROME) struggles with sequential updates. MemLLM simply executes a `MEM_WRITE` command to update the structured triple in the database.

Key Novelty

Explicit Read-Write Memory via API Fine-tuning

Treats memory access as tool use: fine-tunes the LLM to generate text-based API calls (`MEM_WRITE`, `MEM_READ`) interleaved with normal generation.
Uses a structured database (triples of subject, relation, object) rather than raw text or vector pools, making the memory human-readable and editable.
Teaches the model to 'write' to memory while processing context and 'read' from memory before generating entities.

Architecture

The schema of the structured memory and how entities and relations are linked.

Evaluation Highlights

Outperforms standard LLMs on language modeling perplexity (20.53 vs 21.65 for Llama-2-7b-chat) on Re-DocRED, with significant gains on named entities.
Achieves superior knowledge editing performance (Sustainability Score: 24.3 vs 19.5 for ROME) when handling sequential edits.
Demonstrates high efficacy in memory utilization, improving named entity prediction accuracy by +13.5% compared to no-memory baselines.

Breakthrough Assessment

7/10

Strong conceptual advance in making LLM memory interpretable and editable via structured APIs. While the scale is limited to relation triples, it offers a distinct alternative to vector-only RAG or parametric editing.

⚙️ Technical Details

Problem Definition

Setting: Language modeling with integrated memory access

Inputs: Context text segment $S_{<i}$ and focus sentence $s_i$

Outputs: Next token prediction or API command sequence (MEM_WRITE or MEM_READ)

Pipeline Flow

Input Processing: Detect if memory interaction is needed
Write Phase: Extract relations from text → Generate MEM_WRITE API call → Update Triple Memory
Read Phase: Generate MEM_READ API call with queries → Retrieve candidates via vector similarity → Filter candidates → Inject into context
Generation: Produce next tokens using retrieved context

System Modules

LLM Controller

Decides when to read/write and generates natural language or API commands

Model or implementation: Llama-2-7b-chat / Llama-3-8B-Instruct

Memory Store

Stores facts as structured triples linked to vector embeddings

Model or implementation: Custom SQL-like structure + Contriever embeddings

Retriever

Finds relevant triples based on vector similarity to query subjects/relations

Model or implementation: Contriever (frozen)

Novel Architectural Elements

Integration of structured SQL-like triple storage directly driven by LLM-generated API tokens
Dual-capability fine-tuning where the same model acts as both Information Extractor (Write) and RAG-user (Read)

Modeling

Base Model: Llama-2-7b-chat (primary experiments)

Training Method: Supervised Fine-Tuning (SFT) on custom API-interleaved dataset

Objective Functions:

Purpose: Standard language modeling loss.

Formally: Minimize negative log-likelihood of next tokens, applied to API calls and post-retrieval text.

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters (rank and alpha not explicitly detailed in main text, standard usage implied)

Training Data:

Derived from Re-DocRED dataset
Write-data: Sentences paired with extracted triples
Read-data: Text augmented with MEM_READ calls inserted before target entities

Key Hyperparameters:

learning_rate: 3e-4
batch_size: 128 (micro-batch 4)
epochs: 3 (Write model), 3 (Read model)
+ 3 more
lora_r: 8
lora_alpha: 16
max_seq_length: 2048

Compute: Single NVIDIA A100 (40GB or 80GB) used for training

Comparison to Prior Work

vs. ROME/MEMIT: MemLLM uses external structured memory, avoiding weight modification degradation during sequential edits.
vs. RAG: MemLLM writes to its own memory during processing and uses structured triples rather than unstructured document chunks.
vs. ChatDB [not cited in paper]: ChatDB uses chain-of-memory for database manipulation tasks, whereas MemLLM integrates memory for general language modeling and knowledge persistence.

Limitations

Relies on the quality of the Contriever embeddings for retrieval similarity.
Memory write performance depends on the model's ability to extract relations correctly (Information Extraction).
Currently restricted to triplet structures, which may not capture complex nuances as well as unstructured text.
Inference latency increases due to generation of API tokens and retrieval steps.

Reproducibility

Code: https://github.com/amodaresi/MemLLM

Code and training data construction scripts available at https://github.com/amodaresi/MemLLM. Uses Re-DocRED dataset (public). Models are standard Llama-2/3 variants. Contriever used for embeddings.

📊 Experiments & Results

Evaluation Setup

Language Modeling on Re-DocRED and Knowledge Editing tasks

Benchmarks:

Re-DocRED (Language Modeling / Relation Extraction)
CounterFact / ZsRE (Adapted) (Sequential Knowledge Editing)

Metrics:

Perplexity (PPL)
Accuracy (for named entities)
Editing Score
Sustainability Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Re-DocRED	Perplexity (Lower is better)	21.65	20.53	-1.12
Re-DocRED	Named Entity Accuracy	46.2	59.7	+13.5
Sequential Editing (Custom)	Sustainability Score (Ability to retain previous edits)	19.5	24.3	+4.8
Sequential Editing (Custom)	Editing Score	26.3	93.4	+67.1

Experiment Figures

Performance of knowledge editing methods (Editing Score and Sustainability) as the number of sequential edits increases.

Main Takeaways

Explicit memory prevents the 'catastrophic forgetting' observed in parametric model editing (like ROME/MEMIT) when performing many sequential edits.
Fine-tuning for API usage allows the model to autonomously decide when to store and retrieve information without external heuristic controllers.
Structured memory (triples) offers a balance between editability and retrievability, outperforming unstructured RAG in specific entity-centric tasks.

📚 Prerequisite Knowledge

Prerequisites

Language Model Fine-tuning
Knowledge Graphs (Triples)
Retrieval-Augmented Generation (RAG)
Vector Similarity Search

Key Terms

Contriever: A dense retrieval model used to generate vector embeddings for entities and relations to enable similarity search

Re-DocRED: A relation extraction dataset used here for training the model to recognize and utilize entity-relation triples

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

ROME: Rank-One Model Editing—a parametric knowledge editing method that modifies specific model weights to update facts

GRACE: A memory-based editing method that uses a codebook to store edits without modifying weights

Sustainability Score: A metric measuring how well a model maintains performance on previous edits when new edits are applied sequentially

Triple: A structured data format consisting of (Subject, Relation, Object), used as the atomic unit of memory in this system

Parametric memory: Knowledge stored implicitly in the neural network weights of the LLM itself

Explicit memory: Knowledge stored in a separate, accessible module (like a database) that the model interacts with