University of California, Santa Barbara,
Microsoft Research
NeurIPS
(2023)
MemoryBenchmark
📝 Paper Summary
Memory recallLong-context modeling
LongMem augments a frozen LLM with a decoupled residual side-network that retrieves and fuses long-term context from a cached memory bank to solve memory staleness.
Core Problem
Existing methods for extending LLM context length, like Memorizing Transformer, use a coupled memory design where cached representations become stale as the model parameters update during training.
Why it matters:
Standard LLMs are limited by fixed-size input contexts, preventing the use of rich long-term history or knowledge.
Scaling input length via dense attention is computationally prohibitive due to quadratic complexity.
Memory staleness in coupled architectures limits the effectiveness of memory augmentation because old cached keys/values drift distributionally from current model states.
Concrete Example:In a coupled design like MemTRM, if the model updates its weights, the keys stored in memory from previous steps were generated by an older version of the model. When the current model queries this memory, the distributional shift (staleness) degrades retrieval accuracy and fusion quality.
Key Novelty
Decoupled Side-Network for Memory Retrieval (SideNet)
Keeps the backbone LLM frozen as a dedicated memory encoder, ensuring cached memory representations (keys/values) remain stable and compatible throughout training.
Introduces a lightweight, trainable residual SideNet that retrieves relevant past contexts from the memory bank and fuses them with current inputs.
Uses cross-network residual connections to transfer pretrained knowledge from the frozen backbone to the SideNet, enabling efficient adaptation without catastrophic forgetting.
Architecture
The LongMem architecture consisting of the frozen backbone LLM, the cached memory bank, and the residual SideNet.
Evaluation Highlights
Achieves state-of-the-art 40.5% identification accuracy on the challenging ChapterBreak benchmark, significantly surpassing existing x-former baselines.
Improves long-context language modeling perplexity by -1.38 to -1.62 on different length splits of the Gutenberg-2022 corpus compared to baselines.
Demonstrates strong in-context learning with 2k demonstration examples in memory, outperforming MemTRM and standard LLMs on NLU tasks.
Breakthrough Assessment
8/10
Effective architectural solution to the memory staleness problem. The decoupled design allows efficient adaptation of frozen LLMs to infinite-length contexts, showing strong empirical gains on long-context tasks.
⚙️ Technical Details
Problem Definition
Setting: Autoregressive language modeling with access to long-term past context via an external memory bank.
Inputs: Current fixed-size input segment {x_i} and a memory bank containing key-value pairs of previous segments.
Outputs: Predicted next token probability P(x_i | x_1...x_{i-1}, Memory).
Pipeline Flow
Frozen Backbone LLM (encodes current input & populates memory)
Memory Bank (stores past key-value pairs)
SideNet (retrieves memory & fuses with current input)
Output Head (generates token probabilities)
System Modules
Frozen Backbone LLM
Encodes input text into hidden states and key-value pairs. Acts as a stable memory encoder.
Model or implementation: Pretrained Transformer (e.g., GPT-2 style)
Cache Memory Bank
Stores key-value pairs from the backbone's self-attention layers for previous input segments.
Model or implementation: FIFO Queue / Vector Database
SideNet (Retrieval & Fusion)
Retrieves relevant memory chunks and fuses them with current context to predict tokens.
Model or implementation: Transformer Decoder (L layers, reduced depth relative to backbone)
Token-to-Chunk Retriever (Retrieval & Fusion)
Retrieves top-K relevant key-value chunks from memory based on current token query.
Model or implementation: Dense Retrieval (Dot product)
Novel Architectural Elements
Decoupled memory design: Frozen backbone encoder vs. Trainable SideNet retriever/reader.
Cross-network residual connections: Injecting (Layer 2l - Layer 2l-2) differences from backbone into SideNet.
Token-to-Chunk retrieval strategy for attention memory.
Modeling
Base Model: GPT-2 architecture (backbone), SideNet (Transformer decoder)
Training Method: Memory-augmented adaptation training (Standard Language Modeling objective)
Objective Functions:
Purpose: Maximize likelihood of next token given local and memory context.
Formally: max sum log P(x_i | x_1...x_{i-1})
Adaptation: SideNet weights are updated; Backbone LLM is frozen.
Trainable Parameters: SideNet parameters (initialized from backbone)
Key Hyperparameters:
layer_reduction_factor: 2 (SideNet has L' = L/2 layers)
memory_size: 65k tokens
chunk_size: Empirically adjusted (e.g., small for in-context learning labels)
Compute: Not explicitly reported in the paper
Comparison to Prior Work
vs. MemTRM: LongMem decouples the memory encoder (frozen backbone) from the fusion network (SideNet), solving memory staleness.
vs. Sparse Attention: LongMem uses dense attention over retrieved memory chunks rather than fixed sparse patterns, allowing handling of arbitrary long contexts.
Limitations
Requires a pretrained backbone LLM; cannot easily be applied to models where access to intermediate layers is restricted.
Inference cost is increased due to the additional SideNet forward pass and memory retrieval overhead.
The memory bank size is limited by available system memory/VRAM (though scalable to 65k tokens in experiments).
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
ChapterBreak
Identification Accuracy
26.3
40.5
+14.2
Gutenberg-2022
Perplexity
Not reported in the paper
Not reported in the paper
Not reported in the paper
Main Takeaways
LongMem consistently improves perplexity on long-text corpora (Gutenberg) compared to baselines.
The decoupled memory design effectively mitigates memory staleness, leading to large gains in tasks requiring retrieval of distant information (ChapterBreak).
Large memory banks (65k tokens) allow for caching thousands of demonstration examples, significantly boosting in-context learning performance.
SideNet: A lightweight residual network trained to retrieve and fuse memory while the main LLM backbone remains frozen.
Memory Staleness: The issue where cached memory representations become incompatible with the current model state due to parameter updates during training.
Token-to-Chunk Retrieval: A retrieval strategy where queries retrieve blocks (chunks) of tokens rather than individual tokens to improve speed and integrity.
Cross-Network Residual Connections: Connections adding hidden state differences from the frozen backbone to the SideNet to facilitate knowledge transfer.
ChapterBreak: A long-context modeling benchmark that tests a model's ability to identify the correct next chapter segment given a long prefix.
MemTRM: Memorizing Transformer—a baseline model that extends context via a kNN lookup into a database of past key-value pairs.