An Evolved Universal Transformer Memory

📝 Paper Summary

Memory organization KV Cache Management Efficient Transformers

Neural Attention Memory Models (NAMMs) use a small neural network trained via evolution to dynamically select and evict KV cache tokens based on their attention history, improving both efficiency and performance.

Core Problem

Transformers struggle with long contexts due to quadratic attention costs, and existing heuristics for pruning the KV cache (like H2O) are hand-designed, lossy, and often degrade performance.

Why it matters:

Long-context tasks are resource-hungry, making foundation models expensive to train and serve
Hand-designed rules for token eviction inevitably trade off performance for efficiency, failing to distinguish truly useful information from noise
Current methods cannot adaptively shape memory based on task needs or transfer learned memory strategies across different architectures

Concrete Example: In tasks like PassageRetrieval where specific details matter, heuristic methods like H2O evict tokens based on simple accumulation of attention scores. This often discards critical information that appears rarely but is vital for the answer, leading to lower retrieval accuracy compared to the full context.

Key Novelty

Neural Attention Memory Models (NAMMs) optimized via Evolution

Replaces hand-crafted eviction heuristics with a learned neural network that scores tokens for retention based on the frequency patterns of how they were attended to (captured via spectrograms)
Uses evolutionary strategies (CMA-ES) instead of gradients, bypassing the non-differentiable nature of binary keep/drop decisions
Features are constructed purely from attention matrices (not token embeddings), allowing the memory model to transfer zero-shot to completely different transformer architectures and modalities (e.g., vision, RL)

Architecture

The complete pipeline: Attention Matrix extraction, STFT processing, and the BAM network structure.

Evaluation Highlights

Outperforms full-context Llama-3-8B by +11% on LongBench tasks while reducing cache size, showing it removes noise rather than just tolerating loss
Achieves higher performance than hand-designed baselines (H2O, L2) while maintaining smaller average cache sizes across 36 tasks
Demonstrates zero-shot transfer from Llama-3-8B (language) to Stable Diffusion (vision) and Decision Transformer (RL), improving efficiency without retraining the memory model

Breakthrough Assessment

8/10

Strong conceptual novelty in using evolution for differentiable memory management and achieving zero-shot transfer across modalities. effectively turns memory management into a learned, transferable skill.

⚙️ Technical Details

Problem Definition

Setting: KV cache compression for auto-regressive transformers processing long sequences

Inputs: Current attention matrix A (specifically columns corresponding to cached keys/values)

Outputs: Binary retention mask for each token in the KV cache (keep vs. evict)

Pipeline Flow

Feature Extraction (Attention Matrix → STFT Spectrogram)
Feature Compression (Spectrogram → EMA Vector)
Scoring (NAMM Network predicts importance scores)
Eviction (Remove tokens with negative scores)

System Modules

Feature Extractor

Convert raw attention columns into frequency-domain features

Model or implementation: STFT (Short-Time Fourier Transform) with Hann window

NAMM Network (BAM)

Compute scalar importance scores for each token based on its attention history and relation to other tokens

Model or implementation: Backward Attention Memory (BAM) - small self-attention layer + linear head

Eviction Mechanism

Filter the KV cache based on scores

Model or implementation: Thresholding Logic

Novel Architectural Elements

Feature extraction via STFT on attention matrices to create universal, embedding-agnostic inputs
Backward Attention Memory (BAM) architecture using counter-causal masking to model token competition and redundancy

Modeling

Base Model: Llama-3-8B-Instruct (extended to 32k context via NTK-aware interpolation)

Training Method: Evolutionary Optimization (CMA-ES)

Objective Functions:

Purpose: Maximize downstream task performance relative to the full-context baseline.

Formally: Maximize normalized score (e.g., Accuracy_NAMM / Accuracy_FullContext) across a batch of prompts.

Adaptation: None (Base LLM is frozen; only NAMM parameters are evolved)

Trainable Parameters: ~4000 parameters (single shared NAMM across all layers)

Training Data:

Subset of LongBench tasks: PassageRetrieval-en, DuReader, NarrativeQA
Incremental training: Start with 1 task, add others in phases

Key Hyperparameters:

population_size: Not reported in the paper
generations: 300 (Phase 1) + 250 (Phase 2) + 120 (Phase 3)
update_frequency (n_up): Not explicitly reported in the paper

Compute: Inference-only training loop (no backprop); requires running LLM inference for each population member

Comparison to Prior Work

vs. H2O/L2: NAMM learns a non-linear policy via evolution rather than using fixed heuristics; NAMM inputs are frequency-domain features rather than raw accumulations
vs. Prompt Engineering: NAMM operates on latent memory directly, allowing different layers/heads to keep different contexts [not cited in paper as direct baseline, but discussed conceptually]
vs. Soft-Prompting/LoRA: NAMM does not modify base model weights or inputs, only memory retention [not cited in paper]

Limitations

Training requires expensive inference loops (running the full LLM for every evaluation in the population)
Evolutionary optimization scales poorly with parameter count (hence the very small NAMM network)
Performance gains vary by task; some tasks show smaller improvements than others

Reproducibility

Code: https://github.com/SakanaAI/evo-memory

Code is publicly available at https://github.com/SakanaAI/evo-memory. The paper details the specific tasks used for evolution (PassageRetrieval-en, DuReader, NarrativeQA) and the baselines. Hyperparameters for STFT and exact network dimensions are in the appendix/code.

📊 Experiments & Results

Evaluation Setup

Long-context language understanding, generation, and cross-modal transfer

Benchmarks:

LongBench (Multi-task Long Context (QA, Summarization, Retrieval))
InfiniteBench (Ultra-long context (avg ~200k tokens))
ChouBun (Long-context Japanese tasks) [New]

Metrics:

Normalized Performance (relative to full context)
Absolute Performance (Accuracy, F1, ROUGE)
Cache Size (Compression Rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
LongBench results demonstrate NAMM's ability to improve over the full-context baseline while compressing memory, unlike heuristics which degrade performance.
LongBench (All Tasks)	Normalized Performance	1.00	1.11	+0.11
LongBench (Test Tasks - Held Out)	Normalized Performance	1.00	1.07	+0.07
LongBench	Average Cache Size	1024	733	-291
Decision Transformer (Atari Breakout)	Normalized Score	1.0	1.9	+0.9
Stable Diffusion (MS-COCO)	FID (Frechet Inception Distance)	20.5	20.3	-0.2

Experiment Figures

Pareto frontier comparison of NAMM vs. H2O and L2 on LongBench.

Main Takeaways

NAMMs successfully decouple memory management from the specific model weights, allowing zero-shot transfer across modalities (Language → Vision/RL).
Evolutionary training enables optimizing discrete, non-differentiable eviction decisions directly for downstream metrics.
The method acts as a 'denoising' filter for the KV cache, often improving performance over full-context models by removing distracting tokens.
BAM (Backward Attention Memory) architecture is crucial for detecting redundancy by allowing older tokens to attend to newer ones.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, KV Cache)
Short-Time Fourier Transform (STFT)
Evolutionary Strategies (CMA-ES)

Key Terms

KV cache: Key-Value cache—a memory mechanism in transformers that stores intermediate representations of past tokens to avoid recomputing them during generation

NAMM: Neural Attention Memory Model—the proposed auxiliary network that decides which tokens to keep in the KV cache

STFT: Short-Time Fourier Transform—a signal processing technique used here to create spectrograms of attention patterns over time

CMA-ES: Covariance Matrix Adaptation Evolution Strategy—a derivative-free optimization algorithm used to train the memory model

BAM: Backward Attention Memory—the specific architecture of the NAMM network, using counter-causal attention to compare tokens

spectrogram: A visual representation of the spectrum of frequencies of a signal as it varies with time; here represents how attention to a token fluctuates

zero-shot transfer: Applying a model trained on one task/domain to a different one without any further training

token eviction: The process of removing tokens from the KV cache to save memory