QKV cache: Stored Query, Key, and Value tensors from the attention mechanism, allowing the model to skip recalculating these matrices for previously seen text segments
Prefilling: The initial phase of LLM inference where the model processes the input prompt (query + retrieved documents) to generate the first token
Decoding: The sequential phase of LLM inference where the model generates the response token-by-token
Semantic cache: A storage system that saves query-answer pairs and retrieves answers based on the semantic similarity (embedding distance) of new queries
QA bank: The layer in PerCache that stores historical query-answer pairs for direct retrieval
Knowledge bank: The layer in PerCache that stores raw text chunks and their corresponding pre-computed QKV tensors
RAGCache: A baseline method that organizes KV caches of retrieved documents in a tree structure to maximize prefix sharing
Sparsity: In this context, the low frequency and high variance of queries from a single user, making it hard to build a useful cache history
Reactive population: Updating the cache only after a user makes a query and a cache miss occurs (standard approach)
Predictive population: Proactively generating and caching potential queries/tensors during device idle time