Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

📝 Paper Summary

LLM Inference Efficiency Memory Management

DapQ compresses the Key-Value cache by constructing pseudo-queries with future positional encodings, leveraging the insight that query position determines attention patterns more than semantic content.

Core Problem

Existing KV cache compression methods rely on input-side observation windows (e.g., the last few prompt tokens) to estimate token importance, but these windows fail to reflect the actual queries that will occur during the decoding phase.

Why it matters:

Misaligned observation windows cause models to discard critical information (like specific 'needles' in long contexts) needed for future generation, leading to hallucination or forgetting
Ground-truth decoding queries are unavailable during the prefill stage, making it difficult to know *a priori* which cached tokens will be attended to
Long-context inference suffers from massive memory footprints; accurate compression is essential to deploy LLMs on constrained hardware

Concrete Example: In a 'Needle-in-a-Haystack' task where the answer relies on a sentence buried in the middle of a long document, standard methods like SnapKV calculate importance based only on the end of the prompt. Since the end of the prompt may not semantically relate to the buried needle, SnapKV evicts the needle. DapQ simulates the *future* position of the answer generation, correctly identifying and retaining the needle.

Key Novelty

Decoding-aligned KV cache compression via position-aware pseudo queries (DapQ)

Discovers that for attention scoring, the positional encoding of a query vector is far more influential than its semantic content (Where > What)
Constructs 'Pseudo Queries' by appending dummy tokens (copies of prefix/suffix) to the input and assigning them *future* positional IDs corresponding to the decoding steps
Uses these position-aware pseudo queries to probe the Key-Value cache during prefill, retaining only tokens that have high attention scores with the simulated future positions

Architecture

Overview of the DapQ framework compared to standard methods. It illustrates the prefill stage where pseudo queries are appended.

Evaluation Highlights

Achieves 99.5% accuracy on Needle-in-a-Haystack (NIAH) with LLaMA3-8B under a strict 3% KV cache budget (256 tokens), recovering nearly lossless performance
+6.75% accuracy improvement over SnapKV on the 'Hard' category of LongBenchV2 using LLaMA3-8B with a budget of 64 tokens
Outperforms SnapKV by 58.2 points (59.6% vs 1.4%) on the Ruler benchmark's S-NIAH-3 task with a budget of 512 tokens

Breakthrough Assessment

8/10

Offers a highly effective, theoretically grounded solution (position dominance) to the alignment problem in KV compression. The gains on difficult retrieval tasks (NIAH, Ruler) are drastic compared to baselines.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive LLM inference with limited memory for Key-Value (KV) cache

Inputs: Long-context prompt sequence of length L_p

Outputs: Compressed KV cache containing Top-K most important token states for subsequent decoding

Pipeline Flow

Pseudo-Query Construction
Extended Prefill
Importance Assessment
Cache Compression

System Modules

Pseudo-Query Constructor

Create synthetic tokens to simulate future queries

Model or implementation: Deterministic Operation

Extended Prefill

Compute attention scores for the extended sequence

Model or implementation: LLM (e.g., LLaMA-3, Qwen2.5)

Eviction Selector

Select critical KV pairs based on aggregated attention scores

Model or implementation: Ranking Algorithm

Novel Architectural Elements

Injection of 'dummy' tokens with *future* timestamp/positional IDs during the prefill phase solely to probe attention distribution
Use of prefix+suffix concatenation as semantic content for pseudo-queries, while relying on RoPE to provide the primary signal

Modeling

Base Model: LLaMA-3-8B-Instruct, LLaMA-3.1-8B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B

Comparison to Prior Work

vs. SnapKV: SnapKV uses the *past* (prompt end) to judge importance; DapQ uses simulated *future* positions
vs. H2O: H2O relies on historical accumulation; DapQ anticipates future needs via positional probing
vs. Quest [not cited in paper]: Quest estimates token importance by projecting query vectors; DapQ directly constructs pseudo-queries with correct RoPE to measure attention

Limitations

Relies on the assumption that semantic content is secondary to position, which may not hold for all task types (though empirical results are strong)
Introduces slight computational overhead during prefill due to processing N extra pseudo tokens
Effectiveness depends on the chosen window size and construction heuristic (prefix+suffix) for pseudo queries

Reproducibility

Code: https://github.com/tianzhenxu/DapQ

📊 Experiments & Results

Evaluation Setup

Long-context inference tasks across various domains (QA, Summarization, Retrieval)

Benchmarks:

LongBench (Multi-task suite (Single-doc QA, Multi-doc QA, Summarization))
LongBenchV2 (Long-context understanding with hardness categories)
Ruler (Synthetic long-context tasks (e.g., NIAH))
Needle-in-a-Haystack (NIAH) (Retrieval)
HELMET (Long-context benchmark)

Metrics:

Accuracy
Recall (of eviction strategy)
Time-to-First-Token (TTFT)
Throughput
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on challenging retrieval and reasoning benchmarks demonstrates DapQ's ability to retain critical information under strict budgets.
Ruler (S-NIAH-3)	Accuracy	1.4	59.6	+58.2
LongBenchV2 (Hard)	Accuracy	22.51	29.26	+6.75
HELMET	Average Score	43.74	48.10	+4.36
Needle-in-a-Haystack	Accuracy	100.0	99.5	-0.5
Ablation study on query construction methods validates that position matters more than content.
Decoding Query Similarity	Cosine Similarity	0.3522	0.7238	+0.3716

Experiment Figures

Quantitative analysis of the impact of position vs. content on query representation

Recall rates of different eviction strategies on GovReport dataset

Main Takeaways

Positional information dominates semantic content in determining query representations and attention patterns in RoPE-based models
Simulating future positions using dummy tokens allows for highly accurate identification of 'needle' tokens that standard prompt-based observation windows miss
DapQ enables aggressive compression (e.g., down to 64 tokens) while maintaining significantly higher performance than state-of-the-art baselines like SnapKV and PyramidKV
The method introduces negligible latency (TTFT) and maintains throughput comparable to other compression techniques

📚 Prerequisite Knowledge

Prerequisites

Transformer attention mechanism (Query, Key, Value)
Rotary Positional Embeddings (RoPE)
KV Cache (Key-Value Cache) for autoregressive decoding

Key Terms

KV Cache: A mechanism to store Key and Value states of previous tokens during LLM inference to avoid re-computing them at every step

Prefill Phase: The initial phase of LLM inference where the entire input prompt is processed in parallel to generate the initial KV cache

Decoding Phase: The sequential phase of LLM inference where tokens are generated one by one, attending to the cached keys and values

RoPE: Rotary Positional Embeddings—a method to encode token positions by rotating their vector representations, heavily used in modern LLMs like LLaMA

Pseudo Queries: Artificially constructed query vectors used to probe the importance of cached tokens; in DapQ, they are defined by future positional IDs rather than semantic content

Eviction: The process of removing less important Key-Value pairs from the cache to save memory

NIAH: Needle-In-A-Haystack—a benchmark testing an LLM's ability to retrieve a specific piece of information buried in a very long context

TTFT: Time-To-First-Token—the latency required to process the prompt and generate the first output token