TRIME: Training with in-batch memory

📝 Paper Summary

Memory recall Modularized RAG pipeline

TRIME aligns language model representations with memory units during training by treating in-batch segments as accessible memory, enabling effective use of local, long-term, and external memories at inference.

Core Problem

Existing memory-augmented language models typically introduce memories only at test time (e.g., kNN-LM) or use separately trained encoders, resulting in suboptimal alignment between the LM and the memory representations.

Why it matters:

Current approaches miss the opportunity to optimize how the model interacts with memory during the training phase
Separate training leads to a disconnect where the query representation and memory keys are not aligned for the retrieval task
Standard attention mechanisms scale quadratically, limiting the ability to leverage long-range context efficiently without explicit memory structures

Concrete Example: In kNN-LM, the model is trained normally, and a datastore is only added during inference. If a rare word appears in the context, the model's internal representation might not be sharp enough to retrieve the correct instance from the external memory because it was never trained to perform that retrieval.

Key Novelty

TRIME (Training with In-batch Memories)

Utilizes a contrastive loss that aligns the current context's representation with both the target token embedding and positive memory examples from the same batch
Constructs training memories on-the-fly using specific batching strategies (consecutive segments for long-term memory; BM25-similar segments for external memory)
Allows back-propagation through the memory representations, ensuring the query and key representations are jointly optimized

Architecture

Illustration of the TRIME training objective and forward pass.

Evaluation Highlights

Reduces perplexity from 18.70 to 15.37 on WikiText-103 (247M parameter model) by leveraging external memory
Outperforms kNN-LM (perplexity 16.23 → 15.41) and kNN-MT on machine translation, showing better utilization of large datastores
Enables effective use of 15k-25k token contexts, outperforming specialized long-context architectures like Transformer-XL on WikiText-103

Breakthrough Assessment

8/10

Simple yet highly effective training paradigm that unifies local, long-term, and external memory augmentation. It consistently outperforms strong baselines like kNN-LM and Transformer-XL without architectural changes.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling where next-token probability is conditioned on context and a memory set

Inputs: Context sequence c_t = x_1, ..., x_{t-1}

Outputs: Probability distribution over vocabulary V for next token x_t

Pipeline Flow

Batch Construction (Standard, Consecutive, or Lexical-Overlap)
Forward Pass (Transformer Encoder)
Memory Construction (In-batch collection of (context, target) pairs)
Loss Computation (Hybrid of standard LM loss and memory-augmented contrastive loss)

System Modules

Batching Strategy

Selects segments for the current batch to simulate specific memory types (e.g., consecutive segments for long-term memory)

Model or implementation: Heuristic (Consecutive or BM25)

Transformer Encoder

Encodes context into hidden representations

Model or implementation: Standard Transformer (e.g., 247M parameters)

Memory Scorer

Computes similarity between current context and in-batch memory keys

Model or implementation: Scaled Dot-Product

Novel Architectural Elements

In-batch memory construction enabling backpropagation to memory representations (keys)
Unified training objective combining standard cross-entropy with memory-based retrieval probability

Modeling

Base Model: Transformer (247M parameters for WikiText-103)

Training Method: Joint training with TRIME objective

Objective Functions:

Purpose: Minimize negative log-likelihood of the next token using both vocabulary projection and memory retrieval.

Formally: P(w|c) ∝ exp(E_w^T f(c)) + Sum_{(c_j, x_j) in M: x_j=w} exp(sim(g(c), g(c_j)))

Key Hyperparameters:

segment_length_L: 3072 (WikiText-103 large), 150 (WikiText-103 small), 512 (EnWik8)
batch_size_B: Not explicitly reported in the paper
probability_p: 0.9 (probability of excluding local memory from training memory when training for external memory)

Compute: Single NVIDIA RTX 3090 GPU for inference speed tests; Training compute not explicitly detailed

Comparison to Prior Work

vs. kNN-LM: TRIME trains with memory objective vs. inference-only interpolation
vs. Transformer-XL: TRIME requires no architectural changes (recurrence) vs. specialized architecture
vs. Continuous Cache: TRIME aligns representations during training vs. ad-hoc usage at test time
+ 1 more
vs. LaMemo: TRIME uses in-batch construction for efficient training vs. maintaining separate memory queue

Limitations

Inference efficiency drops significantly (10x slower) when using large external memory due to nearest neighbor search
Evaluated primarily on Transformer-based models up to 247M parameters; scalability to multi-billion parameter models is untested
Machine translation evaluation limited to small IWSLT'14 dataset

Reproducibility

Code: https://github.com/princeton-nlp/TRIME

Code and pre-trained models available at https://github.com/princeton-nlp/TRIME. Uses Fairseq library. Large external memory requires FAISS for approximate nearest neighbor search.

📊 Experiments & Results

Evaluation Setup

Language modeling (perplexity/BPC) and Machine Translation (BLEU)

Benchmarks:

WikiText-103 (Word-level Language Modeling)
EnWik8 (Character-level Language Modeling)
BooksCorpus (Domain Adaptation (Language Modeling)) [New]
IWSLT'14 De-En (Machine Translation)

Metrics:

Perplexity (PPL)
Bits Per Character (BPC)
BLEU score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
WikiText-103 (247M) results showing TRIME outperforms vanilla models and kNN-LM baselines.
WikiText-103	Perplexity	18.70	17.76	-0.94
WikiText-103	Perplexity	18.26	17.76	-0.50
WikiText-103	Perplexity	16.23	15.41	-0.82
Long-context capability results on EnWik8.
EnWik8	Bits Per Character	1.06	1.05	-0.01
EnWik8	Bits Per Character	1.16	1.12	-0.04
Machine Translation results showing generality to generation tasks.
IWSLT'14 De-En	BLEU	33.15	33.73	+0.58

Experiment Figures

Perplexity/BPC vs. Long-term memory size for TRIME vs. Continuous Cache.

Main Takeaways

Explicitly training with in-batch memories consistently improves perplexity across multiple benchmarks compared to inference-only memory methods
Batching consecutive segments allows standard Transformers to utilize very long contexts (15k+ tokens) effectively, rivaling specialized architectures like Transformer-XL
Lexical-overlap batching (BM25) successfully simulates external memory retrieval during training, improving performance when using large datastores at test time
The approach generalizes to domain adaptation (BooksCorpus) and other generation tasks (Machine Translation)

📚 Prerequisite Knowledge

Prerequisites

Autoregressive Language Modeling
Transformer Architecture
Contrastive Learning / Noise Contrastive Estimation
k-Nearest Neighbor (kNN) search

Key Terms

TRIME: Training with In-batch Memories—the proposed method that uses in-batch examples as dynamic memory during training

kNN-LM: k-Nearest Neighbor Language Model—a baseline that linearly interpolates LM predictions with retrieval from a datastore at test time

local memory: Tokens appearing in the immediate recent past (current segment)

long-term memory: Tokens from previous segments of the same document, usually inaccessible to standard attention due to length limits

external memory: A large collection of context-target pairs from the entire training corpus or a domain-specific corpus

BM25: Best Matching 25—a ranking function used to estimate the relevance of documents to a given search query

FAISS: Facebook AI Similarity Search—a library for efficient similarity search and clustering of dense vectors

perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

contrastive loss: A loss function that pulls positive pairs (matching context-target) together and pushes negative pairs apart in vector space

continuous cache: A mechanism storing hidden states of recent history to assist prediction via dot-product similarity