Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

📝 Paper Summary

Memory organization Inference acceleration

DMC retrofits pre-trained LLMs to dynamically compress the Key-Value cache at inference time by learning to accumulate token representations via weighted averaging rather than appending every token.

Core Problem

The Key-Value (KV) cache in Transformers grows linearly with sequence length and batch size, making auto-regressive generation memory-bound and limiting throughput for long contexts.

Why it matters:

High Bandwidth Memory (HBM) bottlenecks dominate generation latency, as most time is spent moving weights and KV states rather than computing.
Existing solutions like eviction (H2O, TOVA) or token merging often degrade downstream performance significantly at high compression ratios.
Linear memory growth prevents large batch sizes and long contexts on fixed hardware budgets (e.g., H100 GPUs).

Concrete Example: In a standard Transformer generating a story, every single token (e.g., 'the', 'a') adds a new entry to the memory cache. Over 4000 tokens, this cache becomes massive, forcing the GPU to fetch 4000 vectors for every new prediction, slowing generation to a crawl even if the GPU has compute power to spare.

Key Novelty

Dynamic Memory Compression (DMC)

Replaces the standard 'always append' cache update with a learned decision: either append the new token or merge it into the previous cache slot via weighted averaging.
Uses 'retrofitting' (continued pre-training on ~2-8% of original data) to teach the model how to compress its own memory without adding new parameters.
Learns different compression rates for different attention heads and layers, adapting to the model's internal information flow.

Architecture

Conceptual operation of Dynamic Memory Compression at a single time step.

Evaluation Highlights

Increases inference throughput by 350% to 390% for Llama 2 7B and 13B on NVIDIA H100 GPUs using 4x compression.
Achieves up to 700% throughput gain with 8x compression (approx 5% MMLU drop) compared to uncompressed baselines.
Combines with Grouped Query Attention (GQA) for compounded gains: Llama 2 70B (GQA 8x) + DMC 2x yields 16x total compression.

Breakthrough Assessment

8/10

Offers a practical, hardware-aware solution to the KV cache bottleneck that outperforms eviction baselines and requires no architectural changes (no new parameters), making it highly deployable.

⚙️ Technical Details

Problem Definition

Setting: Auto-regressive sequence generation where past Key-Value states must be stored to avoid recomputation.

Inputs: Sequence of hidden states X = (x_1, ..., x_n)

Outputs: Updated KV cache containing a compressed sequence of keys and values.

Pipeline Flow

Layer Input
Decision Head (reuse existing neurons)
Conditional Update (Append or Accumulate)
Compressed Attention

System Modules

Decision Head

Predicts binary decision alpha (segment boundary) and scalar omega (importance) for current token.

Model or implementation: Reused first neurons of Key and Query projections (no extra params)

Cache Updater

Updates the KV cache based on alpha. If alpha=1, append new k/v. If alpha=0, update last k/v via weighted average using omega.

Model or implementation: Algorithmic update rule (Algorithm 1)

Masked Attention

Simulates inference-time compression during parallel training by masking out interactions with 'overwritten' intermediate states.

Model or implementation: Modified Attention Mask

Novel Architectural Elements

In-place weighted accumulation of KV states governed by a learned decision gate (alpha)
Re-purposing specific existing neurons (first neuron of k/q) as decision heads to avoid adding parameters

Modeling

Base Model: Llama 2 (7B, 13B, and 70B)

Training Method: Continued pre-training (Retrofitting) with Gumbel-sigmoid relaxation

Objective Functions:

Purpose: Maintain original generation capability.

Formally: l_LM = - sum log p(x_t | x_<t)
Purpose: Enforce target compression ratio.

Formally: l_CR = ( sum(alpha) - target_length ) / normalization_factor

Adaptation: Full fine-tuning (continued pre-training)

Trainable Parameters: All parameters (technically), but architectural changes are zero (no new params added)

Training Data:

Uses negligible percentage of original pre-training data (~2% for 2x compression, ~8% for 8x)

Key Hyperparameters:

temperature: Constant (tau) used in Gumbel-sigmoid to sharpen decisions

Compute: NVIDIA H100 or A100 GPUs used for inference measurements

Comparison to Prior Work

vs. GQA: DMC is dynamic/input-dependent per token, whereas GQA is a fixed architectural change. Can be combined.
vs. H2O/TOVA: DMC accumulates information via weighted averaging (preserving semantics) rather than hard eviction (deleting information).
vs. Token Merging (e.g., ToMe) [not cited in paper]: DMC operates online during auto-regressive generation, while ToMe typically merges tokens in non-causal encoder settings.

Limitations

Training requires a custom masking formulation to parallelize the sequential accumulation logic.
High compression ratios (8x) incur a noticeable performance drop (~5% on MMLU).
Requires re-training (retrofitting), unlike plug-and-play eviction policies like H2O.

Reproducibility

Code availability is not provided in the paper text. Detailed implementation of the parallel training mask is described in text and Appendix G (referenced).

📊 Experiments & Results

Evaluation Setup

Auto-regressive inference on NVIDIA GPUs measuring throughput and downstream task accuracy.

Benchmarks:

MMLU (Factuality and Knowledge)
HumanEval (Code Generation)
Common Sense QA (Reasoning)

Metrics:

Throughput (tokens/sec)
MMLU Accuracy
Compression Ratio (CR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Throughput experiments demonstrate significant speedups on high-end hardware due to reduced memory bottlenecks.
Inference on H100	Throughput Gain	0	390	+390
Inference on H100	Throughput Gain	0	700	+700
MMLU	Relative Performance Drop	0	-5	-5

Experiment Figures

Illustration of the KV cache content evolution during inference vs the unrolled view needed for training.

Main Takeaways

DMC achieves sub-linear memory growth, effectively sitting between Transformers (linear) and State Space Models (constant memory).
The method learns to compress differently across layers, revealing that the model prefers compressing heads in higher layers more aggressively.
DMC outperforms eviction baselines (H2O, TOVA) which suffer severe degradation at comparable compression ratios.
Gains are compounded when combining architectural compression (GQA) with learned dynamic compression (DMC).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention)
Auto-regressive decoding
GPU memory hierarchy (HBM vs Compute bound)

Key Terms

KV Cache: A memory store holding Key and Value matrices for past tokens in a Transformer to speed up generation.

GQA: Grouped Query Attention—a method where multiple query heads share a single key/value head to reduce memory usage.

Retrofitting: Adapting a pre-trained model to a new capability (here, compression) via short continued training on a small subset of data.

Gumbel-sigmoid: A continuous relaxation of the discrete sigmoid function allowing gradient descent through binary decisions (like 'append' vs 'accumulate').

HBM: High Bandwidth Memory—the fast memory on GPUs where model weights and KV caches are stored; access speed here often limits LLM speed.

GEMM: General Matrix Multiply—the fundamental mathematical operation in neural networks.

Memory-bound: A computing scenario where execution speed is limited by how fast data can be moved from memory, not how fast the processor can calculate.

MHSA: Multi-Head Self-Attention—the core mechanism in Transformers allowing tokens to attend to other tokens.

H2O/TOVA: Baseline cache eviction policies that drop less important tokens from the KV cache based on attention scores.