Beyond Memorization: The Challenge of Random Memory Access in Language Models

📝 Paper Summary

Memory organization Knowledge internalization

Language models struggle to randomly access specific segments of memorized text despite having memorized the full content, but explicit recitation of the content prior to answering mitigates this failure.

Core Problem

While Language Models (LMs) can effectively memorize and reproduce long passages sequentially, they fail to access specific information located in the middle of these memorized passages when prompted with a unique identifier.

Why it matters:

Treating LMs as knowledge bases requires reliable retrieval of specific facts, not just full rote memorization
Current inability to perform random access limits the utility of LMs in grounded question answering where precise extraction is needed
Understanding this limitation sheds light on the fundamental mechanisms of how transformers store and index information in their parameters

Concrete Example: A model memorizes a passage about Chopin associated with ID '#3022'. If asked to recite the whole passage given '#3022', it succeeds. However, if asked 'According to Document #3022, in what year did Chopin become a French citizen?' (requiring extraction from the middle), the model fails to output the correct year, acting as if it hasn't memorized the text.

Key Novelty

Distinction between Sequential and Random Parametric Memory Access

Formalizes the difference between 'Sequential Access' (generating from the start token) and 'Random Access' (generating from an arbitrary mid-point) in LMs trained on key-value pairs
Demonstrates that simply permuting sentence order during training helps the model unlearn the rigid sequential dependency, improving random access
Proposes 'Recitation' at inference time: forcing the model to output the full memorized passage before answering a specific question allows it to bypass the random access bottleneck

Architecture

Conceptual illustration of Sequential vs. Random Memory Access in LMs

Evaluation Highlights

In selective recitation, models achieve near-perfect sequential access (97.3 BLEU when context is provided) but drop to ~47 BLEU when relying on parametric memory with random access.
In grounded QA, reciting the memorized passage before answering improves Exact Match (EM) scores from ~2.2% to ~28.6% on SQuAD-v1 passages.
Permuting sentences during training improves random access accuracy (Exact Match) from 0.0 to 52.5 on synthetic recitation tasks.

Breakthrough Assessment

7/10

Provides a fundamental insight into the limitations of transformer memory (sequential bias). While the solution (recitation) increases compute, the diagnosis of the 'random access' failure mode is a significant contribution to understanding LM internals.

⚙️ Technical Details

Problem Definition

Setting: Language Model acting as a Key-Value Memory Store

Inputs: A unique identifier k_i (key) and a query (e.g., 'recite sentence j' or a natural language question)

Outputs: Target content p_i (value) or specific span within p_i

Pipeline Flow

Prompt Construction (ID + Query)
Language Model (Fine-tuned as Memory Bank)
Generation (Recitation or Direct Answer)

System Modules

Prompt Construction

Formats the input query with the unique identifier (e.g., 'According to Document #123...')

Model or implementation: Deterministic formatting

Language Model

Stores passage content in parameters and attempts to retrieve it based on ID

Model or implementation: GPT2-large (774M parameters)

Novel Architectural Elements

Intervention pipeline: 'Recitation' step where the model output is forced to include the full passage before the answer, effectively moving memory from parameters to context window

Modeling

Base Model: GPT2-large (774M parameters)

Training Method: Full fine-tuning on synthetic Key-Value pairs

Objective Functions:

Purpose: Minimize negative log-likelihood of the target sequence given the prompt.

Formally: Standard language modeling loss.

Adaptation: Full fine-tuning

Training Data:

Corpus split into T training and V validation passages
Mixed training strategy: Includes Write instances (ID -> Passage) for both T and V, and Read instances (ID -> Answer) for T only

Key Hyperparameters:

learning_rate: 3e-5
epochs: 100
batch_size: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RAG: This paper focuses on *parametric* memory (no external index during inference) to test internal storage mechanisms [not cited in paper]
vs. Mallick et al. (2023): Validates their finding of positional dependency but extends it to 'random access' capability via ID-based retrieval
vs. Zhu and Li (2023): Adopts their mixed training strategy but focuses on access patterns (sequential vs. random) rather than memorization capacity

Limitations

Study limited to GPT-2 (decoder-only) models; unsure if findings hold for much larger or instruction-tuned models (e.g., GPT-4, Llama-3)
Recitation increases inference cost and latency significantly as the model must generate the full passage
Experiments primarily use synthetic or semi-synthetic setups (SQuAD paragraphs treated as isolated memories)
Permutation method improves random access but may degrade coherence for tasks requiring sequential reasoning

Reproducibility

Code: https://github.com/sail-sg/lm-random-memory-access

Code available at https://github.com/sail-sg/lm-random-memory-access. Pre-trained GPT2-large checkpoint used. Detailed prompt templates provided in Appendix.

📊 Experiments & Results

Evaluation Setup

Synthetic memory tasks where the model must memorize Key-Value pairs (ID -> Passage) and retrieve information.

Benchmarks:

Full Recitation (Verbatim reproduction of memorized text) [New]
Selective Recitation (Extracting specific sentence j from memorized passage i) [New]
Grounded QA (SQuAD-v1 based) (Answering questions using memorized passages) [New]

Metrics:

BLEU score
Exact Match (EM)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Synthetic Dataset (Title IDs)	BLEU	0	95.6	+95.6
Synthetic Dataset (Title IDs)	Exact Match (EM)	97.0	34.5	-62.5
Grounded Question Answering results demonstrating the failure of direct random access and the success of recitation.
SQuAD-v1 (Memorized)	Exact Match (EM)	1.3	2.2	+0.9
SQuAD-v1 (Memorized)	Exact Match (EM)	2.2	28.6	+26.4
Synthetic Recitation	Exact Match (EM)	0.0	52.5	+52.5

Experiment Figures

Impact of corpus size on Sequential Memory Access performance (Exact Match)

Performance of Selective Recitation by sentence index

Main Takeaways

Language models exhibit a strong sequential bias; they can recite a whole passage from the start but struggle to access the middle directly.
Providing a unique ID for a memorized passage does not grant the model random access capabilities akin to a database key.
Recitation (generating the passage into context) bypasses the parametric random access bottleneck, converting the problem from memory retrieval to context processing.
Permuting training data helps break sequential dependencies, allowing the model to learn position-independent access to some degree.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer decoder architecture (autoregressive generation)
Familiarity with fine-tuning LMs for knowledge injection
Basic concepts of memory access (sequential vs. random) in computer systems

Key Terms

Sequential Memory Access: Starting recall from the beginning of a memorized sequence and progressing through content in consecutive order (like reciting a poem from the start)

Random Memory Access: Initiating recall from any chosen location within memorized content without needing to generate the preceding tokens first

Parametric Memory: Knowledge stored within the model's weights (learned during pre-training or fine-tuning) rather than provided in the input context

SQuAD: Stanford Question Answering Dataset—a benchmark dataset for reading comprehension where answers are spans of text from provided passages

BLEU: Bilingual Evaluation Understudy—a metric for evaluating text quality by counting matching n-grams between candidate and reference text

Exact Match (EM): A metric measuring the percentage of predictions that match the ground truth answer exactly word-for-word

Recitation: A prompting strategy where the model is asked to reproduce the full memorized passage before generating the answer to a specific question

Permutation: A data augmentation technique during training where sentences in a passage are shuffled to break sequential dependencies and encourage random access learning