Paged Attention and vLLM

This November 2023 paper addresses the challenge of efficiently serving large language models (LLMs) in high-throughput scenarios.

LLMs have become increasingly popular for various applications, such as programming assistants and virtual assistants.

Serving these models is computationally expensive, requiring a large number of hardware accelerators like GPUs. To reduce the cost per request, you have to to increase the throughput of LLM serving systems.

How can this be achieved?

The core of LLMs is an autoregressive Transformer model that generates tokens sequentially based on the input prompt and the previously generated tokens.

This process is repeated for each request until a termination token is generated. To improve throughput, multiple requests can be batched together.

However, the memory space for each request must be efficiently managed to process a large number of requests in a batch.

This paper identifies that existing LLM serving systems struggle to manage the key-value cache (KV cache) memory efficiently.

The KV cache stores the dynamic states (attention keys and values) associated with each request and represents the context from earlier tokens used to generate new output tokens.

The KV cache memory grows and shrinks dynamically as the model generates new tokens, and its lifetime and length are not known in advance.

KV Cache Problem

In Transformer-based models, the attention mechanism uses a KV cache to store the keys and values associated with the tokens generated so far in a sequence.

This cache represents the context used to generate the next token. The KV cache grows dynamically as more tokens are generated, and its size is not known in advance.

Existing LLM serving systems allocate contiguous memory for the KV cache based on the maximum sequence length (e.g., 2048 tokens).

This approach leads to two main problems - inefficient memory use and batching

Memory fragmentation

Internal fragmentation

When the actual sequence length is shorter than the maximum length, a significant portion of the allocated memory remains unused, resulting in wasted memory

External fragmentation

When requests have different maximum sequence lengths, the pre-allocated memory chunks vary in size, leading to inefficient memory utilisation.

Lack of memory sharing

Since the KV cache for each sequence is stored in separate contiguous memory spaces, it is not possible to share memory across sequences, even when they belong to the same request or have overlapping context.

The authors' profiling results show that only 20-40% of the KV cache memory is effectively used to store the actual token states in existing systems, highlighting the inefficiencies in memory management.

Typical Batching Technique Issues

Queueing delays

Requests may arrive at different times, and naïve batching strategies can lead to significant queueing delays. If earlier requests wait for later ones to form a batch, or if incoming requests are delayed until earlier ones finish, it results in increased latency.

Inefficient padding

Requests may have different input and output lengths. Straightforward batching techniques pad the inputs and outputs to equalise their lengths, wasting GPU computation and memory.

To address these issues, fine-grained batching mechanisms like cellular batching and iteration-level scheduling have been proposed.

These techniques operate at the iteration level instead of the request level.

After each iteration, completed requests are removed from the batch, and new ones are added. This allows new requests to be processed after waiting for a single iteration rather than waiting for the entire batch to complete.

Paged Attention

To address these limitations, the authors propose PagedAttention, an attention algorithm inspired by virtual memory and paging in operating systems.

PagedAttention divides the KV cache into fixed-size blocks (similar to memory pages), which can be allocated on-demand and need not be stored in contiguous memory.

This approach reduces internal fragmentation by using smaller blocks and eliminates external fragmentation since all blocks have the same size. Additionally, it enables memory sharing at the block level, both within and across requests.

Key concepts

KV cache

In Transformer-based large language models (LLMs), the attention mechanism uses a KV cache to store the keys and values associated with the tokens generated so far in a sequence. This cache represents the context used to generate the next token. The size of the KV cache grows dynamically as more tokens are generated.

KV blocks

PagedAttention partitions the KV cache of each sequence into fixed-size KV blocks.

Each block contains the key and value vectors for a fixed number of tokens, denoted as the KV block size $(B)$ . By using fixed-size blocks, PagedAttention reduces internal fragmentation and enables more efficient memory management.

Logical and physical KV blocks

Logical KV blocks represent a request's KV cache as a series of blocks, filled from left to right as new tokens and their KV cache are generated.

Physical KV blocks are the actual contiguous chunks of memory allocated on the GPU DRAM or CPU RAM. PagedAttention allows logical blocks to be mapped to non-contiguous physical blocks, providing flexibility in memory allocation.

Block tables

vLLM maintains block tables that map logical KV blocks to physical KV blocks for each request.

Each block table entry records the corresponding physical blocks of a logical block and the number of filled positions. The block table enables efficient tracking and management of the KV cache memory.

PagedAttention algorithm

The PagedAttention algorithm computes the attention scores and outputs using a block-wise approach.

In the PagedAttention algorithm, the key and value vectors are divided into fixed-size blocks to enable efficient memory management and computation.

The key block $Kj$ and value block $Vj$ represent the $j-th$ block of keys and values, respectively.

The key block $Kj$ contains the key vectors for a contiguous range of tokens, from token index $(B(j-1)+1)$ to token index $Bj$ , where $B$ is the block size.

Similarly, the value block $Vj$ contains the value vectors for the same range of tokens.

For example, if the block size $B$ is 4, then:

$K1$ would contain the key vectors for tokens 1 to 4
$K2$ would contain the key vectors for tokens 5 to 8
$V1$ would contain the value vectors for tokens 1 to 4
$V2$ would contain the value vectors for tokens 5 to 8

The key and value blocks are used in the attention computation to calculate the attention scores and outputs efficiently by processing the tokens in chunks of size $B$ .

Example of PagedAttention

Consider an example where the key and value vectors are spread across three blocks, which are not contiguous in physical memory (as shown in the figure below).
For each token (e.g., "forth"), the kernel multiplies its query vector $(q_i)$ with the key vectors $(K_j)$ in a block to compute the attention scores $(A_ij).$
Then, the kernel multiplies the attention scores $(A_ij)$ with the value vectors $(V_j)$ in a block to derive the final attention output $(o_i).$

In summary, the PagedAttention algorithm allows the KV blocks to be stored in non-contiguous physical memory.

This is achieved by dividing the key and value vectors into blocks and performing the attention computation block-wise.

Block table translation in vLLM

vLLM uses block tables to map logical KV blocks to physical KV blocks for each request.

The block table entries record the corresponding physical blocks and the number of filled positions for each logical block.

The figure below illustrates an example of block table translation in vLLM.

The logical KV blocks of a request are mapped to non-contiguous physical KV blocks in GPU memory.

As new tokens are generated, vLLM updates the block table and allocates new physical blocks as needed.

In summary, the PagedAttention algorithm, combined with the block table translation in vLLM, allows the KV cache to be efficiently managed and stored in non-contiguous physical memory.

This approach reduces memory fragmentation, enables dynamic growth of the KV cache, and improves memory utilisation in large language model serving systems.

Virtual Memory in Operating Systems and how it relates

Virtual memory is a memory management technique used by operating systems to provide a logical address space that is separate from the physical address space.

It allows programs to access memory as if it were a single, contiguous address space, even though the physical memory may be fragmented and spread across different locations.

In a virtual memory system, the operating system (OS) partitions the physical memory into fixed-size pages, typically 4KB in size.

Each program's memory is divided into logical pages of the same size. The OS maintains a page table for each process, which maps the logical pages to the physical pages in memory. When a program accesses a memory location, the OS translates the logical address to the corresponding physical address using the page table.

One of the main benefits of virtual memory is that it allows the OS to allocate physical memory to processes on-demand. When a process requests memory, the OS allocates logical pages to the process, but it doesn't necessarily allocate physical pages until the process actually accesses the memory. This is known as lazy allocation, and it allows the OS to efficiently manage memory by only allocating physical pages when they are needed.

Another benefit of virtual memory is that it allows the OS to swap pages between physical memory and secondary storage (e.g., disk) when there is not enough physical memory to hold all the pages.

When a process accesses a logical page that is not currently in physical memory, the OS can swap out a less frequently used page to disk and load the requested page into memory. This is known as paging, and it allows the OS to run programs that require more memory than is physically available.

How does vLLM relate to PagedAttention?

vLLM is an end-to-end LLM serving system that uses the PagedAttention algorithm at its core.

The system consists of a FastAPI frontend and a GPU-based inference engine.

The frontend extends the OpenAI API interface, allowing users to customise sampling parameters for each request.

What is Fast API and OpenAI APIs?

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.

It is used to build the frontend of the vLLM serving system, which exposes an API for users to interact with the language model.

The OpenAI API is a set of APIs provided by OpenAI that allows developers to access and integrate the functionality of their language models into applications.

The vLLM frontend extends the OpenAI API interface, which means that it provides a compatible API that mimics the functionality and request/response format of the OpenAI API. This allows users who are familiar with the OpenAI API to easily integrate vLLM into their existing applications.

In the context of the vLLM serving system, the FastAPI frontend acts as the entry point for user requests. It receives requests containing prompts and generation parameters, validates them, and passes them to the vLLM engine for processing. Once the vLLM engine generates the output, the FastAPI frontend sends the response back to the user in the same format as the OpenAI API.

By using FastAPI and providing an OpenAI API-compatible interface, vLLM makes it simple for developers to leverage the power of the underlying language model in their applications, without having to learn a new API or make significant changes to their existing codebase.

The vLLM engine is implemented in Python and C++/CUDA, with control-related components like the scheduler and block manager developed in Python, while custom CUDA kernels are used for key operations such as PagedAttention.

Popular LLMs like GPT, OPT, and LLaMA are implemented using PyTorch and Transformers, and NCCL is used for tensor communication across distributed GPU workers.

PagedAttention introduces new memory access patterns that are not efficiently supported by existing systems.

To optimise these patterns, vLLM develops several custom GPU kernels:

Fused reshape and block write

In each Transformer layer, the new KV cache is split into blocks, reshaped to a memory layout optimised for block read, and saved at positions specified by the block table. These operations are fused into a single kernel to minimise kernel launch overheads.

Fusing block read and attention

The attention kernel in FasterTransformer is adapted to read KV cache according to the block table and perform attention operations on the fly. A GPU warp is assigned to read each block to ensure coalesced memory access. Support for variable sequence lengths within a request batch is also added.

Fused block copy

Block copy operations, issued by the copy-on-write mechanism, may operate on discontinuous blocks.

To mitigate the overhead of numerous small data movements using the cudaMemcpyAsync API, vLLM implements a kernel that batches the copy operations for different blocks into a single kernel launch.

vLLM implements various decoding algorithms using three key methods:

fork: Creates a new sequence from an existing one.
append: Appends a new token to the sequence.
free: Deletes the sequence.

These methods are combined to support different decoding algorithms.

For example, in parallel sampling, vLLM creates multiple output sequences from a single input sequence using the fork method, adds new tokens to these sequences in every iteration with append, and deletes sequences that meet a stopping condition using free.

The same strategy is applied in beam search and prefix sharing.

The integration of PagedAttention in vLLM enables efficient memory management and sharing during the LLM serving process.

By partitioning the KV cache into fixed-size blocks and mapping logical blocks to non-contiguous physical blocks, vLLM reduces memory fragmentation and allows for dynamic memory allocation as new tokens are generated.

The integration of PagedAttention in vLLM, along with custom GPU kernels and support for various decoding algorithms, enables vLLM to deliver state-of-the-art performance while reducing memory fragmentation and overhead, ultimately making LLM serving more efficient and cost-effective.

Other PagedAttention Methods

Parallel Sampling

In some applications, like program assistants, an LLM generates multiple sampled outputs for a single input prompt, allowing users to choose their preferred output.

PagedAttention and vLLM's memory management enable efficient memory sharing in this scenario.

When multiple outputs share the same input prompt, vLLM reserves space for only one copy of the prompt's KV cache. The logical blocks for the prompts of all sequences are mapped to the same physical blocks. This allows the KV cache of the shared prompt to be stored only once, saving memory.

For the generated outputs, vLLM uses a copy-on-write mechanism at the block level.

When a sequence needs to modify a shared block, vLLM creates a new physical block, copies the data from the original block, and updates the mapping for that sequence. This ensures that each sequence has its own copy of the modified block, while still sharing the unchanged blocks.

Beam Search

Beam search is a decoding algorithm that maintains a set of top-k most probable partial sequences (candidates) at each step. It allows the LLM to explore multiple high-probability paths and find the most likely output sequence.

With PagedAttention, vLLM enables memory sharing not only for the initial prompt blocks but also for other blocks across different candidates. As the beam search progresses, the candidates share common blocks and diverge only when necessary.

vLLM uses a reference counting mechanism to keep track of the number of candidates sharing each physical block. When a candidate is discarded, the reference counts of its blocks are decremented. When a reference count reaches zero, the corresponding physical block is freed and can be reused for other candidates or sequences.

Reference Counting in Beam Search

In the beam search example, vLLM uses a reference counting mechanism to efficiently manage the sharing of physical blocks among different beam candidates.

Reference counting is a memory management technique used to track the number of references to a particular resource, in this case, a physical block. When a resource is no longer needed, it can be safely deallocated.

Here's how reference counting works in vLLM's beam search implementation

Each physical block has an associated reference count, which represents the number of logical blocks (i.e., beam candidates) that are currently referring to it.
When a new beam candidate is created and shares a physical block with an existing candidate, the reference count of that physical block is incremented.
As the beam search progresses and candidates are discarded (e.g., candidates 0 and 3 in the example), the reference counts of the physical blocks associated with those candidates are decremented.
When a physical block's reference count reaches zero, it means that no beam candidates are currently using that block, and it can be safely deallocated (e.g., blocks 2, 4, 5, and 8 in the example).
When a new candidate needs to modify a shared physical block (e.g., when generating new tokens), vLLM applies the copy-on-write mechanism. It creates a new physical block, copies the data from the original block, and updates the reference counts accordingly.

The reference counting mechanism allows vLLM to efficiently manage the memory used by the beam candidates, as it enables the system to:

Share physical blocks among candidates when possible, reducing memory usage.
Keep track of when physical blocks are no longer needed and can be deallocated, preventing memory leaks.
Implement the copy-on-write mechanism, which allows candidates to modify shared blocks without affecting other candidates, while minimizing the amount of memory copying required.

By using reference counting, vLLM can efficiently manage the memory resources during the beam search process, enabling faster execution and reducing the overall memory footprint of the serving system.

Shared Prefix

In some applications, like machine translation, multiple input prompts may share a common prefix, such as a task description or examples. vLLM allows the LLM service provider to store the KV cache of the shared prefix in advance, reducing redundant computation.

The service provider can reserve a set of physical blocks for predefined shared prefixes.

When a user input prompt contains a shared prefix, vLLM maps the logical blocks of the prompt to the pre-computed physical blocks. This eliminates the need to recompute the KV cache for the shared prefix, saving computation time.

Mixed Decoding Methods

vLLM's PagedAttention allows for the simultaneous processing of requests with different decoding preferences, such as parallel sampling, beam search, and shared prefixes.

This is achieved through a common mapping layer that translates logical blocks to physical blocks.

The LLM and its execution kernel work with the physical block IDs provided by the scheduler, without needing to handle the complex memory sharing patterns between sequences. This abstraction enables vLLM to efficiently batch requests with different decoding requirements, improving overall system throughput.

In all these mechanisms, the core concept of PagedAttention remains the same: dividing the KV cache into fixed-size blocks and mapping logical blocks to physical blocks. This allows for flexible memory management, efficient memory sharing, and reduced memory footprint.

By leveraging the ideas of copy-on-write, reference counting, and pre-computation, vLLM extends the benefits of PagedAttention to various decoding scenarios, enabling efficient memory utilisation and improved performance in LLM serving systems.

Application Example

Let's walk through a technical example of how vLLM and PagedAttention handle an incoming request to a Transformer-based language model.

Example: Processing a Request with vLLM and PagedAttention

Suppose a user sends a request to a Transformer-based LLM, such as GPT-3, with the following prompt: "The quick brown fox jumps over the lazy"

The LLM is tasked with generating the next word in the sentence. Let's assume the LLM generates the word "dog".

Step 1: Request Handling

The vLLM frontend, built with FastAPI, receives the request containing the prompt and the desired output length (in this case, 1 token).
The frontend passes the request to the vLLM engine, which is responsible for processing the request and generating the output.

Step 2: Prompt Processing

The vLLM engine tokenizes the prompt into individual tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy"]
vLLM allocates logical KV blocks to store the key and value vectors for the prompt tokens. In this example, let's assume a block size of 4 tokens.
- Logical Block 0: ["The", "quick", "brown", "fox"]
- Logical Block 1: ["jumps", "over", "the", "lazy"]
The logical blocks are mapped to physical KV blocks in the GPU memory. The block table maintains the mapping:
- Logical Block 0 -> Physical Block 3
- Logical Block 1 -> Physical Block 7

Step 3: Attention Computation

The PagedAttention algorithm is used to compute the attention scores and values.
For each token in the prompt, the query vector is multiplied with the key vectors in the corresponding KV block to compute the attention scores.
The attention scores are then used to compute the weighted sum of the value vectors, generating the context vector for each token.

Step 4: Token Generation

The LLM's decoder takes the context vectors and generates the next token, "dog", based on the learned language model.

Step 5: KV Cache Update

The key and value vectors for the generated token "dog" need to be added to the KV cache.
vLLM checks the block table and realizes that Logical Block 1 is not full (it contains 3 tokens out of the block size of 4).
The key and value vectors for "dog" are appended to Logical Block 1, which is mapped to Physical Block 7.
The block table is updated to reflect that Logical Block 1 now contains 4 tokens.

Step 6: Output Generation

The generated token "dog" is returned to the vLLM frontend, which sends the response back to the user.

In this example, vLLM efficiently managed the KV cache using PagedAttention.

The prompt tokens were stored in logical KV blocks, which were mapped to physical blocks in the GPU memory.

When generating the new token, vLLM utilised the existing space in Logical Block 1 to store the key and value vectors, avoiding unnecessary memory allocation.

If the user were to send a follow-up request with the prompt "The quick brown fox jumps over the lazy dog", vLLM would use the existing KV blocks and allocate new ones as needed, efficiently reusing the memory and reducing fragmentation.

By leveraging PagedAttention and the block table, vLLM can dynamically manage the KV cache memory, allocate memory only when needed, and share memory across requests and sequences, ultimately leading to improved performance and reduced memory footprint in LLM serving.

PreviousStreamingLLM NextTensorRT-LLM

Last updated 1 year ago

Was this helpful?