Attention and Compression is all you need for Controllably Efficient Language Models

📝 Paper Summary

Memory recall Efficient transformers

CAT is a transformer architecture that decodes tokens by attending to parallelly compressed representations of past chunks, allowing a single model to trade off quality and compute at test-time.

Core Problem

Standard transformers have quadratic attention costs, while efficient alternatives (sparse/linear attention) often sacrifice in-context recall or require fixed compute budgets that cannot adapt to varying task requirements.

Why it matters:

Diverse downstream tasks have different resource needs; a single fixed-budget model is suboptimal for both low-latency email replies and high-recall code completion.
Existing efficient methods often use heuristic attention masks or complex recurrent states that struggle with long-context information retention.
Training multiple models for different efficiency trade-offs is prohibitively expensive.

Concrete Example: Code auto-completion requires high-recall access to function names defined far back in a repository (demanding dense attention), whereas short email replies need low latency (sufficing with linear attention). Current methods force a choice before training, preventing a single model from handling both optimally.

Key Novelty

Compress & Attend Transformer (CAT)

Splits sequences into chunks and compresses each chunk in parallel into a compact representation using a 'compressor' transformer.
Decodes new tokens by attending only to these compressed past representations and the current raw chunk, significantly reducing memory and compute.
Enables test-time adaptivity by training with variable chunk sizes; a single model can switch between high-efficiency (large chunks) and high-quality (small chunks) modes instantly.

Architecture

The CAT architecture layout showing parallel compression and autoregressive decoding.

Evaluation Highlights

Matches dense transformer perplexity on FineWeb-Edu while being 1.4-3x faster and requiring 2-9x lower total memory.
Surpasses dense transformer on real-world in-context recall tasks even in its least efficient setting (cat-4), while being 1.5x faster and 2x more memory efficient.
Outperforms linear attention baselines (Mamba2, GatedDeltaNet) and hybrid architectures on long-context understanding benchmarks.

Breakthrough Assessment

9/10

Offers a rare combination of superior performance and efficiency compared to dense transformers, with the unique capability of test-time compute-quality interpolation in a single model.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive sequence modeling where the model predicts the next token given a context of previous tokens.

Inputs: Sequence of tokens split into chunks.

Outputs: Predictive distribution for tokens in the current chunk.

Pipeline Flow

Sequence Chunking
Parallel Compression
Autoregressive Decoding

System Modules

Compressor

Compresses a chunk of raw tokens into a compact latent representation.

Model or implementation: Bidirectional dense transformer (Hidden size D, projected to Dg)

Decoder

Generates the next tokens by attending to the compressed history and current local tokens.

Model or implementation: Causal dense transformer (Hidden size Dg = 2D)

Novel Architectural Elements

Decoupled Compressor and Decoder allowing parallel processing of past context.
Variable chunk-size training with indicator tokens to enable test-time compute control.
Memory grows linearly but at a significantly reduced rate (N/C) compared to standard transformers.

Modeling

Base Model: CAT (Custom architecture)

Trainable Parameters: ~1B (comparable to baselines)

Training Data:

15B tokens of FineWeb-Edu

Key Hyperparameters:

learning_rate: 8e-4 (peak)
weight_decay: 0.1
batch_size: 0.5M tokens
+ 5 more
context_length: 4K
chunk_sizes: {4, 8, 16, 32}
decoder_hidden_size: 2048 (2*D)
compressor_hidden_size: 1024 (D)
layers: 12 (Decoder), 3 (Compressor)

Compute: Scalable training implementation (O(N^2/C) complexity)

Comparison to Prior Work

vs. Sparse/Linear: CAT allows adaptive quality-compute trade-off at test time via chunk size.
vs. Recurrent compression (Rae et al.): CAT allows parallel training and avoids BPTT issues.
vs. NSA: CAT reduces both compute AND memory (KV cache) by discarding raw past tokens, whereas NSA only reduces compute.

Limitations

Naive decoder implementation can be slow; requires specific custom masking for efficiency.
Requires 2x hidden size in decoder to match dense transformer perplexity, increasing parameter count slightly.
Compression is lossy; extremely large chunk sizes may eventually degrade fine-grained recall (though less than linear baselines).

Reproducibility

Code: https://github.com/rajesh-lab/cat-transformer

📊 Experiments & Results

Evaluation Setup

Language modeling, common-sense reasoning, and long-context recall tasks.

Benchmarks:

FineWeb-Edu (FW) (Language Modeling (Perplexity))
LAMBADA (LMB) (Language Modeling)
LongBench (Long-context understanding)
Real-world In-Context Recall (Information Retrieval from Context)

Metrics:

Perplexity (Zero-shot)
Accuracy
Recall Performance
Throughput (tokens/sec)
Memory Usage (GB)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Language modeling results showing CAT matches or beats baselines on perplexity.
FineWeb-Edu (FW)	Perplexity	13.62	13.20	-0.42
FineWeb-Edu (FW)	Perplexity	14.28	13.20	-1.08
Long-context understanding capabilities compared to efficient architectures.
LongBench	Average Score	29.7	31.0	+1.3
LongBench	Average Score	27.8	31.0	+3.2
Efficiency metrics (Speed and Memory) demonstrating significant savings.
Inference Efficiency	Throughput (tokens/s)	1455.5	4693.3	+3237.8
Inference Efficiency	Memory Usage (MB)	22960	2496	-20464

Experiment Figures

Trade-off curve between In-Context Recall Accuracy and Throughput/Memory.

KV Cache Memory usage and Throughput scaling with sequence length.

Main Takeaways

CAT successfully decouples memory consumption from sequence length (O(N/C)), allowing much longer contexts than dense transformers within the same budget.
The adaptive training strategy works: a single model can effectively switch between chunk sizes (4, 8, 16, 32) at inference time to modulate performance vs. speed.
Unlike linear attention models which struggle with in-context recall, CAT maintains high recall accuracy even with compression, likely because it retains compressed 'snapshots' rather than a single rolling state.
Increasing decoder width (2x) is crucial for CAT to match dense transformer perplexity, suggesting compressed decoding requires more expressive capacity.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, KV Cache)
Autoregressive decoding
Linear vs. Quadratic complexity

Key Terms

CAT: Compress & Attend Transformer—the proposed architecture that compresses past context chunks into dense representations.

KV Cache: Key-Value Cache—storage of pre-computed attention keys and values to speed up generation, which CAT reduces significantly.

Chunk: A fixed-size segment of consecutive tokens from the input sequence.

Compressor: A bidirectional transformer component in CAT that encodes a raw token chunk into a smaller set of compressed vector representations.

Decoder: A causal transformer component in CAT that generates new tokens by attending to current raw tokens and past compressed representations.

Indicator Token: A learnable token passed to the model during training and inference to signal the current chunk size (compression rate).

In-context recall: The ability of a model to retrieve and utilize specific information found earlier in its input context.

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance.