Evaluation Setup
Language modeling, common-sense reasoning, and long-context recall tasks.
Benchmarks:
- FineWeb-Edu (FW) (Language Modeling (Perplexity))
- LAMBADA (LMB) (Language Modeling)
- LongBench (Long-context understanding)
- Real-world In-Context Recall (Information Retrieval from Context)
Metrics:
- Perplexity (Zero-shot)
- Accuracy
- Recall Performance
- Throughput (tokens/sec)
- Memory Usage (GB)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Language modeling results showing CAT matches or beats baselines on perplexity. |
| FineWeb-Edu (FW) |
Perplexity |
13.62 |
13.20 |
-0.42
|
| FineWeb-Edu (FW) |
Perplexity |
14.28 |
13.20 |
-1.08
|
| Long-context understanding capabilities compared to efficient architectures. |
| LongBench |
Average Score |
29.7 |
31.0 |
+1.3
|
| LongBench |
Average Score |
27.8 |
31.0 |
+3.2
|
| Efficiency metrics (Speed and Memory) demonstrating significant savings. |
| Inference Efficiency |
Throughput (tokens/s) |
1455.5 |
4693.3 |
+3237.8
|
| Inference Efficiency |
Memory Usage (MB) |
22960 |
2496 |
-20464
|
Main Takeaways
- CAT successfully decouples memory consumption from sequence length (O(N/C)), allowing much longer contexts than dense transformers within the same budget.
- The adaptive training strategy works: a single model can effectively switch between chunk sizes (4, 8, 16, 32) at inference time to modulate performance vs. speed.
- Unlike linear attention models which struggle with in-context recall, CAT maintains high recall accuracy even with compression, likely because it retains compressed 'snapshots' rather than a single rolling state.
- Increasing decoder width (2x) is crucial for CAT to match dense transformer perplexity, suggesting compressed decoding requires more expressive capacity.