Efficient Memory Management for Large Language Model Serving with PagedAttention

📝 Paper Summary

LLM Serving Systems KV Cache Optimization Memory Management

vLLM introduces PagedAttention to manage KV cache using virtual memory concepts, enabling non-contiguous storage and flexible memory sharing to significantly increase serving throughput.

Core Problem

Existing LLM serving systems require contiguous memory allocation for the Key-Value (KV) cache, leading to severe fragmentation and waste because request lengths are unknown and dynamic.

Why it matters:

KV cache is the dominant memory consumer in LLM serving; inefficient management limits batch size and throughput
Pre-allocating memory for maximum sequence length wastes 60-80% of reserved space (internal fragmentation)
Contiguous requirements prevent memory sharing between parallel samples or beam search candidates

Concrete Example: A request to the 13B OPT model might reserve 1.6 GB for a potential 2048 tokens. If it only generates 50 tokens, the vast majority of that reserved contiguous block is wasted and cannot be used by other requests.

Key Novelty

PagedAttention & vLLM

Treats KV cache like operating system virtual memory: divides cache into fixed-size blocks that can be stored in non-contiguous physical memory
Uses a block table to map logical token sequences to physical GPU memory blocks, allowing dynamic allocation on demand
Enables zero-copy memory sharing for advanced decoding (beam search, parallel sampling) via reference counting (similar to copy-on-write)

Architecture

System architecture of vLLM showing the interaction between the centralized Scheduler, the KV Cache Manager, and distributed GPU Workers.

Evaluation Highlights

Improves serving throughput by 2-4x compared to Orca and FasterTransformer with same latency level
Reduces KV cache memory waste to near-zero (under 4% internal fragmentation) vs 60-80% in existing systems
Enables significantly larger batch sizes: ~132 requests/batch for vLLM vs ~73 for Orca (Oracle) on Alpaca dataset with OPT-13B

Breakthrough Assessment

9/10

Fundamental architectural shift in how LLM memory is managed. PagedAttention has become the industry standard for high-performance inference (adopted by TGI, TensorRT-LLM, etc.).

⚙️ Technical Details

Problem Definition

Setting: High-throughput autoregressive generation for Large Language Models serving

Inputs: Batched user prompts (sequences of tokens)

Outputs: Generated token sequences

Pipeline Flow

Centralized Scheduler (Group: Control)
KV Cache Manager (Group: Memory Management)
Block Allocator (Group: Memory Management)
GPU Workers / Model Executors (Group: Execution)

System Modules

Centralized Scheduler

Orchestrates request execution, prioritizes batches, and broadcasts instructions to workers

Model or implementation: N/A (Control Logic)

KV Cache Manager (Memory Management)

Manages PagedAttention block tables and tracks reference counts for physical blocks

Model or implementation: N/A (Logic)

Block Allocator (Memory Management)

Allocates physical memory chunks on GPU (and CPU for swapping)

Model or implementation: N/A

Model Executor

Executes the LLM inference using custom PagedAttention kernels

Model or implementation: Transformer (OPT, LLaMA, GPT)

Novel Architectural Elements

Separation of logical KV blocks (contiguous in request view) and physical KV blocks (non-contiguous in memory)
Block Table mechanism injected into the Attention layer execution
Preemptive scheduling with CPU-swapping for KV blocks based on block granularity

Modeling

Base Model: OPT (13B, 66B, 175B), LLaMA (13B)

Compute: Experiments run on NVIDIA A100 GPUs (40GB and 80GB variants). No training performed (inference only system).

Comparison to Prior Work

vs. FasterTransformer: vLLM uses non-contiguous PagedAttention vs. FT's contiguous buffer requirement
vs. Orca: vLLM allocates memory on-demand per block vs. Orca's reservation of max_length chunks
vs. FlexGen [not cited in paper]: FlexGen focuses on offloading for high-throughput batch processing on limited hardware, whereas vLLM focuses on low-latency serving with paged memory [not cited in paper]

Limitations

PagedAttention kernel introduces slight overhead compared to highly optimized contiguous kernels (though outweighed by batching gains)
Requires custom CUDA kernels, making it harder to port to new hardware architectures quickly
Benefit is less pronounced for workloads with known, fixed, short sequence lengths where fragmentation is less of an issue

Reproducibility

Code: https://github.com/vllm-project/vllm

Code is publicly available at https://github.com/vllm-project/vllm. Used ShareGPT and Alpaca datasets for workloads. Detailed block size and memory configurations provided.

📊 Experiments & Results

Evaluation Setup

Serving OPT and LLaMA models on NVIDIA A100 GPUs using synthetic workloads derived from real datasets

Benchmarks:

ShareGPT (Variable length conversation traces)
Alpaca (Instruction following traces)

Metrics:

Throughput (requests per second)
Normalized Latency (seconds per token)
Memory Waste (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Memory efficiency analysis shows vLLM drastically reduces wasted space compared to baselines.
Profiling on A100	Internal Fragmentation (%)	79.6	3.8	-75.8
Batch size capabilities demonstrate how memory efficiency translates to higher concurrency.
ShareGPT (OPT-13B)	Average Batched Requests	13.62	30.42	+16.80
Alpaca (OPT-13B)	Average Batched Requests	72.75	132.44	+59.69
Throughput experiments showing performance gains under latency constraints.
Alpaca (OPT-175B, 8 GPUs)	Request Rate (req/s) before latency spike	10	18	+8

Experiment Figures

Normalized latency vs. Request rate curves for OPT models (13B, 66B, 175B) on ShareGPT and Alpaca datasets.

Bar chart comparing average number of batched requests for Orca variants vs vLLM.

Main Takeaways

vLLM achieves near-zero memory waste (mostly small internal fragmentation within the last block), whereas existing systems waste >60% due to reservation.
The memory efficiency directly translates to 2-4x higher throughput by allowing significantly larger batch sizes on the same hardware.
Benefits are most pronounced for larger models (OPT-175B) and complex decoding scenarios (Beam Search) where memory pressure is highest.
Parallel sampling and beam search gain additional efficiency through vLLM's block-level memory sharing (Copy-on-Write).

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention mechanism)
Operating System Virtual Memory (Paging, Virtual-to-Physical mapping)
GPU Architecture (Memory hierarchy, Coalesced access)

Key Terms

KV cache: Cached Key and Value tensors in the Transformer attention mechanism, stored to avoid recomputing previous tokens' states during autoregressive generation

PagedAttention: An attention algorithm that allows attention computation on non-contiguous memory blocks, akin to paging in OS

Internal fragmentation: Memory allocated to a process (request) but not used because the allocation unit (chunk) is larger than required

External fragmentation: Free memory exists in small non-contiguous chunks but cannot be used because the allocator requires a large contiguous block

Block Table: A data structure mapping logical blocks (consecutive tokens) to physical blocks (non-contiguous GPU memory addresses)

Copy-on-write: Optimization strategy where multiple consumers share the same data until one modifies it, at which point a separate copy is created

Beam search: A decoding algorithm that explores multiple likely output sequences (beams) simultaneously