KV cache: Key-Value cache—intermediate data tensors generated during the attention mechanism of Transformers, stored to avoid recomputation in subsequent steps.
Disaggregated Inference: Splitting the LLM inference process into separate instances for the prefill phase and the decode phase to optimize for their distinct hardware requirements.
Context Caching: Storing and reusing the KV cache for requests that share the same prompt prefix (e.g., system prompts or documents) to speed up the prefill phase.
PD-colocated: Prefill-Decode colocated—standard inference where both phases happen on the same GPU instance.
PD-disaggregated: Prefill-Decode disaggregated—inference where prefill and decode phases occur on separate, specialized GPU instances.
MemPool: The core component of MemServe; a distributed memory management layer handling allocation, indexing, and transfer of KV cache across instances.
JCT: Job Completion Time—the total time taken to finish processing a batch or stream of requests.
TTFT: Time-To-First-Token—the latency from request arrival to the generation of the first output token.
Radix Tree: A data structure used to index prompt tokens to cached KV blocks, allowing efficient prefix matching for cache reuse.