MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

📝 Paper Summary

LLM Serving Systems Memory Management Distributed Inference

MemServe introduces a unified memory pool (MemPool) that enables LLM serving systems to simultaneously support both context caching (inter-request optimization) and disaggregated inference (intra-request optimization) for the first time.

Core Problem

Existing LLM serving systems cannot simultaneously apply inter-request optimizations (context caching) and intra-request optimizations (disaggregated inference) because they lack mechanisms to manage and transfer KV cache data across distributed instances flexibly.

Why it matters:

Current systems treat KV cache as intermediate data scoped to a single request/instance, preventing reuse in distributed settings like disaggregated inference.
Missing mechanisms mean high-efficiency techniques like splitting prefill/decode phases cannot benefit from caching shared prompts, wasting compute and increasing latency.
Existing schedulers (load-based or session-based) fail to maximize KV cache reuse across loosely coupled sessions in distributed environments.

Concrete Example: In disaggregated inference, a request is split into prefill and decode phases on different GPUs. Current context caching methods cannot reuse the KV cache generated during the decode phase back at the prefill instance for future requests, nor efficiently transfer historical cache between them, preventing the combined benefits of both techniques.

Key Novelty

MemPool: An Elastic Memory Pool for Distributed KV Cache

Decouples memory management from the inference engine by introducing a unified substrate (MemPool) that manages all cluster memory (GPU HBM and CPU DRAM).
Provides a unified API for identifying, indexing, and transferring KV cache data across different physical instances, enabling data flow between prefill and decode nodes.
Uses a global prompt tree scheduler to route requests to instances holding relevant cached data, maximizing cache hits even in distributed settings.

Architecture

The overall architecture of MemServe, showing the interaction between the Global Scheduler, Inference Instances (Prefill/Decode), and the unified MemPool.

Evaluation Highlights

MemPool-based disaggregated inference improves Job Completion Time (JCT) by up to 42% compared to standard colocated serving (PD-colocated) on ShareGPT workloads.
Enhancing disaggregated inference with context caching further improves JCT by 29% over the non-cached disaggregated baseline.
On the LooGLE dataset (long prompts), combining disaggregated inference with context caching improves JCT by 26.9% compared to disaggregated inference alone.

Breakthrough Assessment

8/10

Strong systems contribution. Effectively solves the 'either-or' problem between two critical LLM optimizations (caching vs. disaggregation) via a novel memory abstraction, with significant performance gains.

⚙️ Technical Details

Problem Definition

Setting: Distributed LLM inference serving supporting stateful optimizations

Inputs: Stream of inference requests with varying prompt lengths and shared prefixes

Outputs: Generated tokens with minimized latency (TTFT) and job completion time (JCT)

Pipeline Flow

Global Scheduler: receives requests → routes to instances using Global Prompt Tree
MemPool (on Instance): manages local HBM/DRAM → indexes KV cache via Radix Tree
Inference Engine (vLLM): computes attention → requests/transmits KV cache via MemPool APIs

System Modules

Global Scheduler

Routes requests to the instance with the highest potential for KV cache reuse

Model or implementation: N/A (Logic component)

MemPool

Manages physical memory (HBM/DRAM), indexes cached data, and handles data transfer between instances

Model or implementation: N/A (System component)

Inference Engine

Executes the LLM model layers (prefill or decode)

Model or implementation: Modified vLLM

Novel Architectural Elements

Unified MemPool layer decoupling KV cache lifecycle from individual requests/instances
Global Prompt Tree scheduler for locality-aware routing in disaggregated settings
transfer_with_insert API enabling atomic transfer and indexing of KV cache between prefill and decode nodes

Modeling

Base Model: Evaluated with Llama-2-7B, Llama-2-13B, and Llama-3-8B (implied by workload descriptions, though specific model architecture details focus on serving infrastructure)

Compute: Single server with eight H800-80G GPUs used for all tests

Comparison to Prior Work

vs. SGLang: SGLang only supports PD-colocated caching; MemServe enables caching in disaggregated setups.
vs. DistServe: DistServe optimizes disaggregation but lacks context caching; MemServe adds caching to it.
vs. Splitwise: MemServe adds caching and a unified memory pool, whereas Splitwise focuses purely on the scheduling/separation aspect.
+ 1 more
vs. InfiniteLLM: MemServe generalizes the data transfer mechanism (MemPool) rather than specific sequence parallelism logic [not cited in paper as direct baseline, but mentioned as related work].

Limitations

Current implementation relies on naive NCCL/socket primitives which may be suboptimal for high-frequency small transfers (addressed via huge pages co-optimization but still a bottleneck).
Global scheduler adds a centralized coordination point which could become a bottleneck at extremely large scales.
Evaluation is limited to a single server with 8 GPUs; multi-node scaling behavior is not explicitly tested.

Reproducibility

Implemented MemPool and scheduler in 5.6K lines of Python + 1.6K C++. Modified vLLM (v0.2.6 implied by context) with ~600 lines of code. Code availability marked as 'not yet released' in paper text, though vLLM modification size is specified.

📊 Experiments & Results

Evaluation Setup

LLM Serving on 8x H800-80G GPUs

Benchmarks:

ShareGPT (Real-world chat dataset (mix of short/long prompts))
LooGLE (Long-context QA (very long prompts, short generation))

Metrics:

Job Completion Time (JCT)
Time-To-First-Token (TTFT)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments on ShareGPT workload showing the benefits of combining disaggregation with caching.
ShareGPT	JCT improvement	1.0	1.42	+42% improvement (speedup)
ShareGPT	JCT improvement	1.0	1.29	+29% improvement (speedup)
Experiments on LooGLE workload (long context) demonstrate impact on prefill-heavy tasks.
LooGLE	JCT improvement	1.0	1.108	+10.8% improvement (speedup)
LooGLE	JCT improvement	1.0	1.269	+26.9% improvement (speedup)

Experiment Figures

The four design milestones (steps) to achieving full caching with disaggregated inference: (a) PD-Basic, (b) PD-Caching-1, (c) PD-Caching-2, (d) PD-Caching-3.

Main Takeaways

Unified architecture works: MemServe successfully bridges the gap between context caching and disaggregated inference.
Significant gains in JCT: Both techniques individually improve performance, but their combination yields the highest throughput.
Effective for diverse workloads: Improvements are seen in both general chat (ShareGPT) and long-context (LooGLE) scenarios.
Overhead mitigation: Co-optimizing memory layout with huge pages is crucial to reduce the overhead of frequent data transfers in disaggregated setups.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM inference phases: prefill (compute-bound) vs. decode (memory-bound)
Familiarity with KV cache (Key-Value cache) in Transformer models
Knowledge of distributed systems concepts (heterogeneous memory, network primitives)

Key Terms

KV cache: Key-Value cache—intermediate data tensors generated during the attention mechanism of Transformers, stored to avoid recomputation in subsequent steps.

Disaggregated Inference: Splitting the LLM inference process into separate instances for the prefill phase and the decode phase to optimize for their distinct hardware requirements.

Context Caching: Storing and reusing the KV cache for requests that share the same prompt prefix (e.g., system prompts or documents) to speed up the prefill phase.

PD-colocated: Prefill-Decode colocated—standard inference where both phases happen on the same GPU instance.

PD-disaggregated: Prefill-Decode disaggregated—inference where prefill and decode phases occur on separate, specialized GPU instances.

MemPool: The core component of MemServe; a distributed memory management layer handling allocation, indexing, and transfer of KV cache across instances.

JCT: Job Completion Time—the total time taken to finish processing a batch or stream of requests.

TTFT: Time-To-First-Token—the latency from request arrival to the generation of the first output token.

Radix Tree: A data structure used to index prompt tokens to cached KV blocks, allowing efficient prefix matching for cache reuse.