Evaluation Setup
Evaluation on long-context understanding and generation tasks with varying KV cache budgets.
Benchmarks:
- LongBench (Multi-task long-context understanding (16 English tasks))
- RULER (Needle-in-a-Haystack style synthetic tasks)
- LongProc (HTML to TSV) (Long-form output generation)
- MT-Bench (Multi-turn conversation)
Metrics:
- Average Score (LongBench, RULER)
- Eviction Latency / Overhead
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Eviction Cost |
Eviction Overhead (relative to inference) |
31.32 |
2.16 |
-29.16
|
Main Takeaways
- LookaheadKV solves the latency bottleneck of draft-based eviction methods, achieving up to 14.5x faster eviction while matching or exceeding their accuracy.
- The method is robust across varying cache budgets (from 64 to 2048 tokens), often outperforming baselines significantly in low-budget settings.
- Despite being trained on 16K context, the method generalizes effectively to 32K context lengths (demonstrated on RULER).
- Lookahead LoRA is efficient, adding less than 0.5% additional parameters, and its selective activation preserves the original model's behavior for normal tokens.