IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

📝 Paper Summary

Efficient Inference Sparse Attention

IndexCache accelerates sparse attention by running the expensive token selection mechanism on only a few anchor layers and reusing those indices for subsequent layers.

Core Problem

DeepSeek Sparse Attention (DSA) reduces core attention cost, but the 'lightning indexer' (token selector) still runs at quadratic complexity on every layer, dominating latency in long-context scenarios.

Why it matters:

Long-context workflows (agents, RAG) are bottlenecked by attention costs, specifically the prefill latency which grows quadratically
In sparse attention models, the selection mechanism (indexer) becomes a non-negligible cost fraction (O(NL^2)) even if core attention is efficient
Existing methods optimize core attention but overlook the redundancy in the token selection step itself

Concrete Example: In a 30B DSA model processing long context, the indexer must score all previous tokens at every single layer (1 to N). However, Layer 5 and Layer 6 often select 70-100% of the same tokens, meaning Layer 6's expensive indexer computation is largely redundant.

Key Novelty

Cross-Layer Index Reuse for Sparse Attention

Partitions layers into 'Full' (retain indexer) and 'Shared' (reuse indices from previous Full layer), exploiting the observation that important tokens remain stable across adjacent layers
Introduces a greedy layer selection algorithm (Training-free) to identify which layers must keep their indexers based on calibration loss
Proposes multi-layer distillation (Training-aware) to train indexers to select tokens that are optimal for a cluster of subsequent layers, not just their own

Architecture

The inference workflow of IndexCache compared to standard DSA.

Evaluation Highlights

Achieves up to 1.82x prefill speedup on a 30B DSA model at 200K context length compared to standard DSA
Eliminates 75% of indexer computations (retaining only 1/4 of indexers) with negligible quality degradation across benchmarks
Achieves 1.48x decoding speedup by skipping indexer computations in Shared layers

Breakthrough Assessment

7/10

Provides a practical, significant speedup for production-grade sparse attention (DSA) by addressing a specific quadratic bottleneck (the indexer). The combination of training-free and training-aware methods makes it versatile.

⚙️ Technical Details

Problem Definition

Setting: Optimizing inference latency for Large Language Models using Sparse Attention mechanisms

Inputs: Input token sequence of length L

Outputs: Next token prediction

Pipeline Flow

Layer Role Assignment (Static config: Full or Shared)
Inference Loop (Iterate layers 1 to N)
Conditional Indexing (If Full: Compute; If Shared: Reuse)

System Modules

Layer Role Configuration

Determines whether layer L is 'Full' (F) or 'Shared' (S) based on a pre-computed pattern string c

Model or implementation: Binary Pattern String

Lightning Indexer

Scores preceding tokens and selects Top-k indices. Only runs if Role is 'Full'.

Model or implementation: Low-rank projection + Multi-head ReLU-gated dot product

Index Cache

Stores indices from the nearest preceding 'Full' layer to be used by subsequent 'Shared' layers.

Model or implementation: Memory Buffer

Sparse Core Attention

Performs attention mechanism only on the selected Top-k tokens.

Model or implementation: Multi-head Latent Attention (MLA)

Novel Architectural Elements

Hybrid layer architecture where 'Shared' layers physically bypass the indexer module and load indices from a cache populated by 'Full' layers
Conditional inference branch introduced into the standard DSA block

Modeling

Base Model: 30B DSA model (DeepSeek Sparse Attention) and GLM-5 744B (preliminary)

Training Method: Multi-layer Distillation (Training-aware IndexCache)

Objective Functions:

Purpose: Train the indexer at a Full layer to predict tokens relevant for itself and all subsequent Shared layers it serves.

Formally: L_multi = Sum_{j=0 to m} D_KL(p^(l+j) || q^(l)) where p is target attention and q is indexer output.

Adaptation: Fine-tuning of indexer parameters

Trainable Parameters: Lightning Indexer weights

Training Data:

Uses standard continue pre-training data

Key Hyperparameters:

k: 2048 (number of selected tokens)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DSA: IndexCache skips the indexer in ~75% of layers via reusing cached indices
vs. Anchor-based Full Attention: IndexCache does not use Full Attention at all; it reuses indices from the lightweight sparse indexer itself
vs. Uniform Interleaving: IndexCache uses greedy search to preserve critical indexers rather than skipping periodically

Limitations

Greedy search for layer patterns requires O(N^2) forward passes (mitigated by block-wise search)
Cross-layer reuse assumes stability; if a layer requires radically different tokens than its predecessor, performance may drop
Training-free method relies on calibration data being representative of downstream tasks

Reproducibility

Methodology is described in detail (algorithms and loss functions). Specific code URL and model weights are not provided in the text. Calibration data specifics are generic (cached mini-batches).

📊 Experiments & Results

Evaluation Setup

Long-context inference and reasoning tasks

Benchmarks:

Various long-context and reasoning benchmarks (Long-context QA / Reasoning)

Metrics:

Prefill Latency / Speedup
Decode Latency / Speedup
Reduction in Indexer Computation (FLOPs)
Language Modeling Loss / Downstream Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance improvements on a 30B DSA model using IndexCache compared to standard DSA.
30B DSA Model (200K Context)	Prefill Speedup	1.00	1.82	+0.82
30B DSA Model (200K Context)	Decode Speedup	1.00	1.48	+0.48
30B DSA Model	Indexer Computation Retained	100	25	-75
GLM-5 744B	Speedup	1.0	1.3	+0.3

Experiment Figures

Latency breakdown of DSA inference as context length increases, and the speedup from IndexCache.

Main Takeaways

Cross-layer index redundancy in DSA is high (70-100% overlap between adjacent layers), enabling aggressive sharing.
Uniform interleaving (e.g., keeping every 4th indexer) degrades quality because indexer importance is non-uniform; some layers are 'critical'.
Greedy layer selection identifies these critical layers using calibration loss, allowing 75% removal with negligible degradation.
Multi-layer distillation allows even uniform patterns to perform well by explicitly training the indexer to serve multiple layers.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Self-Attention quadratic complexity
Familiarity with Sparse Attention mechanisms (specifically DeepSeek Sparse Attention)
Knowledge of Knowledge Distillation (KL Divergence)

Key Terms

DeepSeek Sparse Attention (DSA): A sparse attention mechanism that uses a lightweight 'lightning indexer' to select top-k relevant tokens for core attention

Lightning Indexer: A module in DSA that scores all preceding tokens to determine which ones should be attended to (computationally cheaper than full attention but still quadratic)

Top-k: A selection strategy that keeps only the k elements with the highest scores

Cross-layer stability: The empirical observation that consecutive transformer layers often attend to the same or highly similar sets of tokens

Calibration set: A small set of data used to evaluate model sensitivity to changes (like removing indexers) without full retraining

Distillation: Training a model (student) to match the output distribution of another model or objective (teacher)

MLA: Multi-head Latent Attention—the core attention mechanism used within the DSA framework

Prefill: The initial phase of LLM inference where the prompt is processed to generate the first token (often compute-bound due to context length)

Greedy search: An algorithmic approach that makes the locally optimal choice at each step (here, deciding which layer to convert to 'Shared' to minimize loss)