PerCache: Predictive Hierarchical Cache forRAGApplications on Mobile Devices

📝 Paper Summary

Memory recall Modularized RAG pipeline

PerCache reduces mobile RAG latency by predictively caching both query-answer pairs and intermediate QKV tensors, leveraging the observation that users frequently repeat queries and retrieve overlapping knowledge chunks.

Core Problem

Mobile RAG applications suffer from high latency due to limited on-device resources and lengthy prompts, while existing caching solutions (KV or semantic cache) fail because they target only single stages and reactively populate caches, leading to low hit rates under sparse user query patterns.

Why it matters:

Mobile devices have limited parallel computing capabilities, making both prefilling and decoding stages significant latency bottlenecks that single-stage caches cannot fully address.
Single-user mobile queries are sparse and semantically varied compared to cloud settings, causing reactive caches to remain cold and ineffective.
Privacy-sensitive applications like meeting assistants require low latency to be usable, but current approaches incur delays up to 10x longer than standard queries due to retrieval overhead.

Concrete Example: A user asks 'When is the rehearsal?' and later 'Is time of rehearsal given?'. A standard KV cache misses the second query because the prompts differ slightly, while a semantic cache misses if the similarity is below a high threshold. Meanwhile, both queries retrieve the same document chunks, but standard systems re-compute the attention tensors for these chunks from scratch every time.

Key Novelty

Predictive Hierarchical Caching (PerCache)

Hierarchical structure: Caches results at two levels—semantic QA pairs (to skip inference entirely) and QKV tensors of retrieved chunks (to skip prefilling computation)—maximizing reuse across different stages.
Predictive population: Instead of waiting for users to ask questions, the system proactively generates potential future queries based on knowledge content and history during idle time to populate the cache.
Resource-aware scheduling: Dynamically manages cache size and converts between QA and QKV storage based on real-time device memory and computation constraints.

Architecture

Overview of PerCache system architecture, detailing the hierarchical cache (QA Bank + Knowledge Bank) and the predictive population mechanism.

Evaluation Highlights

Reduces end-to-end latency by up to 34.4% compared to the best-performing baseline (RAGCache) across various applications.
Improves cache hit rates for QKV cache by up to 37.56% and QA bank by up to 13.8% using the predictive mechanism.
Maintains optimal latency under dynamic resource changes by elastically bypassing cache population (reducing overhead by 14.12%).

Breakthrough Assessment

7/10

Significant practical contribution for on-device AI. It addresses the specific 'sparsity' problem of single-user caches with a novel predictive mechanism, though the core concept combines existing caching strategies (semantic + KV).

⚙️ Technical Details

Problem Definition

Setting: On-device Retrieval-Augmented Generation (RAG) for single-user personalized applications

Inputs: User query q, Personal Knowledge Base (chunks)

Outputs: Generated response (either retrieved from QA cache or generated via LLM with accelerated prefilling)

Pipeline Flow

Prediction: Query Prediction Module → Cache Population (Idle Time)
Inference: Input Query → QA Bank Check → Knowledge Bank Retrieval → LLM Inference (with QKV reuse) → Output

System Modules

Query Prediction Module

Predicts likely future queries based on historical queries and knowledge content to pre-populate the cache

Model or implementation: LLM (Same as inference model, e.g., Llama-3.2-3B)

QA Bank (Inference / Retrieval)

Stores query-answer pairs and embeddings to skip inference for semantically similar queries

Model or implementation: Embedding Model

Knowledge Bank (with Cache Slicer) (Inference / Retrieval)

Retrieves relevant text chunks and checks for pre-computed QKV tensors

Model or implementation: Dense Retriever

LLM Inference Engine

Generates the final response using retrieved context and reusing QKV tensors where available

Model or implementation: Llama-3.2-3B (or similar mobile LLM)

Cache Scheduler

Dynamically manages cache storage and computation, deciding when to evict or convert QA pairs to QKV tensors

Model or implementation: Heuristic / Optimization logic

Novel Architectural Elements

Hierarchical caching combining QA semantic matching and QKV tensor reuse in a single pipeline
Predictive cache population loop running alongside the inference pipeline
Cross-layer cache conversion mechanism managed by a resource-aware scheduler

Modeling

Base Model: Llama-3.2-3B, Qwen2.5-3B, or Gemma-2-2B (depending on experiment)

Comparison to Prior Work

vs. RAGCache: PerCache adds a semantic QA layer and predictive population, addressing decoding latency and query sparsity, whereas RAGCache only optimizes prefilling via KV reuse.
vs. GPTCache: PerCache includes a fallback to QKV reuse if semantic match fails, ensuring acceleration even for partial hits, whereas GPTCache reverts to full inference on miss.
vs. CacheBlend [not cited in paper]: CacheBlend focuses on blending KV caches for large batch processing; PerCache focuses on single-user mobile latency and predictive population.

Limitations

Predictive caching consumes battery and compute during idle time, which may impact mobile battery life (though paper claims adaptation)
Effectiveness relies heavily on the quality of the query prediction model; poor predictions waste storage
Requires local storage for both QA pairs and potentially large QKV tensors, which is constrained on mobile devices
Evaluation is limited to specific QA datasets; generalization to complex reasoning or creative tasks is untested

📊 Experiments & Results

Evaluation Setup

RAG QA on mobile devices (Google Pixel 7 / Xiaomi 14) using personal data datasets

Benchmarks:

Email Dataset (Question Answering over personal emails) [New]
Dialog Dataset (Question Answering over daily dialogue transcripts) [New]
Public Datasets (details not fully specified in snippet) (RAG QA)

Metrics:

End-to-end Latency (ms)
Cache Hit Rate (QA Bank and QKV Cache)
Token Acceptance Rate
Memory Usage
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PerCache demonstrates significant latency reduction compared to baselines across different applications.
Email / Dialog Datasets	Latency Reduction (vs best baseline)	Not reported in the paper	Not reported in the paper	34.4%
Email / Dialog Datasets	QKV Cache Hit Rate Improvement	Not reported in the paper	Not reported in the paper	+37.56%
Email / Dialog Datasets	QA Bank Hit Rate Improvement	Not reported in the paper	Not reported in the paper	+13.8%
Resource Constraint Simulation	Computational Overhead Reduction	Not reported in the paper	Not reported in the paper	14.12%

Experiment Figures

Latency breakdown of three queries on Mobile vs. Server, illustrating why single-stage caching fails on mobile.

Sparsity analysis of single-user queries showing low overlap ratio (Fig 5) and low semantic similarity (Fig 6) when using reactive caching.

Main Takeaways

Hierarchical caching is essential for mobile RAG because mobile inference latency is distributed across both prefilling and decoding; single-stage caches (KV-only or Semantic-only) miss significant optimization opportunities.
Predictive cache population significantly boosts hit rates for single-user scenarios where query history is too sparse to warm up a reactive cache effectively.
The system effectively trades off storage for compute, using idle time to pre-compute tensors, which is viable on modern mobile devices with decent memory but thermal-constrained compute.
Repeated retrieval of specific knowledge chunks is common in personal RAG (e.g., email/schedule), making chunk-level QKV caching highly effective.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) workflow
Transformer architecture (specifically Attention mechanism and KV caching)
Mobile hardware constraints (latency, memory)

Key Terms

QKV cache: Stored Query, Key, and Value tensors from the attention mechanism, allowing the model to skip recalculating these matrices for previously seen text segments

Prefilling: The initial phase of LLM inference where the model processes the input prompt (query + retrieved documents) to generate the first token

Decoding: The sequential phase of LLM inference where the model generates the response token-by-token

Semantic cache: A storage system that saves query-answer pairs and retrieves answers based on the semantic similarity (embedding distance) of new queries

QA bank: The layer in PerCache that stores historical query-answer pairs for direct retrieval

Knowledge bank: The layer in PerCache that stores raw text chunks and their corresponding pre-computed QKV tensors

RAGCache: A baseline method that organizes KV caches of retrieved documents in a tree structure to maximize prefix sharing

Sparsity: In this context, the low frequency and high variance of queries from a single user, making it hard to build a useful cache history

Reactive population: Updating the cache only after a user makes a query and a cache miss occurs (standard approach)

Predictive population: Proactively generating and caching potential queries/tensors during device idle time