vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

📝 Paper Summary

LLM Inference Optimization GPU Memory Management

vAttention uses CUDA Virtual Memory Management APIs to retain contiguous virtual memory for LLM KV caches while allocating physical memory on demand, enabling the use of unmodified high-performance attention kernels.

Core Problem

PagedAttention, the standard for dynamic KV cache management, fragments virtual memory, forcing complex kernel rewrites and incurring runtime overheads for address translation and metadata management.

Why it matters:

Rewriting kernels for PagedAttention (PA) is difficult, causing production systems to lag behind state-of-the-art research (e.g., FlashAttention-3 did not initially support PA).
PA introduces overhead in the critical path: vLLM's paged kernels are up to 2.8x slower than standard FlashAttention-2 kernels.
Managing block tables adds CPU overhead, contributing up to 30% latency in decode iterations in some configurations.

Concrete Example: When a request grows dynamically, PagedAttention allocates non-contiguous memory blocks. To compute attention, the kernel must manually traverse a software-managed 'Block Table' to find data, unlike standard kernels that simply access a contiguous array. This software translation slows down execution and complicates kernel code.

Key Novelty

Decoupled Virtual/Physical Allocation (vAttention)

Reserves a large contiguous range of virtual memory for the KV cache upfront but delays physical memory allocation until needed, mimicking OS-level demand paging on the GPU.
Leverages low-level CUDA VMM (Virtual Memory Management) APIs to map physical pages to the pre-reserved virtual addresses on the fly, keeping the buffer contiguous to the application.
Hides the high latency of OS-level memory allocation by overlapping allocation with computation and opportunistically pre-allocating pages for future tokens.

Architecture

Conceptual comparison of memory layouts between PagedAttention and vAttention

Evaluation Highlights

Improves end-to-end serving throughput by up to 1.23x compared to PagedAttention-based FlashInfer on Llama-3-8B.
Outperforms vLLM's decode throughput by up to 1.99x when using vAttention with the standard FlashAttention-2 kernel.
Enables immediate support for FlashAttention-3 (FA3) without code changes, yielding 1.26x-1.5x higher throughput over PagedAttention-based FlashAttention-2 (since FA3 lacks native PA support).

Breakthrough Assessment

8/10

Significant systems contribution that solves a major fragmentation pain point without the software complexity of PagedAttention. Restores compatibility with standard kernels, likely simplifying future LLM serving stacks.

⚙️ Technical Details

Problem Definition

Setting: High-throughput Large Language Model (LLM) serving on GPUs with limited memory

Inputs: Batched inference requests with unknown output lengths

Outputs: Generated token sequences

Pipeline Flow

Request Scheduler (Batches requests)
vAttention Memory Manager (Manages Virtual/Physical mappings)
Model Executor (Runs unmodified attention kernels)

System Modules

vAttention Memory Manager

Intercepts memory needs and maps physical pages to contiguous virtual addresses using CUDA VMM

Model Executor

Executes the LLM layers and attention mechanism

Model or implementation: Supports Llama-3-8B, Yi-6B, Yi-34B

Novel Architectural Elements

Integration of CUDA VMM APIs directly into the serving loop to decouple virtual/physical allocation
Custom GPU driver modification to enable 64KB page granularity on NVIDIA GPUs (standard support is 2MB huge pages)

Modeling

Base Model: Yi-6B, Llama-3-8B, Yi-34B

Comparison to Prior Work

vs. vLLM: vAttention maintains virtual contiguity, removing the need for Block Tables and custom kernels.
vs. FasterTransformer: vAttention allocates physical memory dynamically, avoiding the massive fragmentation of static pre-allocation.
vs. FlashAttention-2 (Native): vAttention enables FA2 to run on dynamically growing caches without pre-allocation [not cited in paper as a conflict, but as a beneficiary].

Limitations

Requires OS/Driver level support (custom driver needed for 64KB pages on current NVIDIA stack)
High latency of CUDA VMM calls necessitates complex masking techniques (lookahead, threading)
Evaluation limited to NVIDIA A100 GPUs; other architectures not tested

Reproducibility

Code availability is not explicitly provided in the text. The paper mentions modifying the open-source CUDA unified virtual memory driver but does not provide a repository URL.

📊 Experiments & Results

Evaluation Setup

LLM serving throughput measurement on real hardware

Benchmarks:

Yi-6B (Text Generation)
Llama-3-8B (Text Generation)
Yi-34B (Text Generation)

Metrics:

End-to-end serving throughput (requests/sec or tokens/sec)
Decode throughput
Kernel execution latency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Decode throughput comparisons showing vAttention's advantage over vLLM's PagedAttention implementation.
Yi-6B	Normalized Throughput	1.0	1.99	+0.99
End-to-end serving throughput comparisons against PagedAttention-based baselines.
Llama-3-8B	Normalized Throughput	1.0	1.23	+0.23
Llama-3-8B	Normalized Throughput	1.0	1.18	+0.18
Demonstration of portability benefits by enabling FlashAttention-3 (FA3) which lacks native PagedAttention support.
Llama-3-8B	Normalized Throughput	1.0	1.5	+0.5

Experiment Figures

Latency overhead of PagedAttention kernels compared to non-paged kernels in FlashAttention-2 and FlashInfer libraries

Impact of block size on vLLM's paged decode kernel performance

Main Takeaways

vAttention eliminates the performance penalty of PagedAttention, which is caused by non-contiguous memory access and metadata management overhead.
The approach is portable: it instantly supports new kernels (like FlashAttention-3) that are released without PagedAttention support, preventing the 'lag' seen in production systems.
Using 64KB pages (via custom driver) effectively mitigates fragmentation without causing TLB thrashing or performance degradation compared to huge pages.
Optimizations like lookahead allocation are critical to mask the high latency of CUDA VMM APIs.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM inference (prefill vs. decode phases)
Knowledge of GPU memory hierarchy (Virtual vs. Physical memory)
Familiarity with Attention mechanisms (Query, Key, Value matrices)

Key Terms

KV cache: Stored Key and Value vectors from previous tokens in an LLM generation sequence, reused to avoid recomputation

PagedAttention: A memory management technique that splits the KV cache into non-contiguous blocks to reduce fragmentation, requiring specialized attention kernels

CUDA VMM: CUDA Virtual Memory Management—Low-level APIs allowing explicit control over virtual address reservation and physical memory mapping on NVIDIA GPUs

TLB: Translation Lookaside Buffer—A hardware cache used to reduce the time taken to access a user memory location

FlashAttention: A highly optimized, IO-aware exact attention algorithm that typically expects contiguous memory inputs

FlashInfer: A kernel library for LLM serving offering high-performance attention implementations

Internal fragmentation: Wasted memory space within allocated blocks (e.g., reserving max context length when only a fraction is used)