Paged Attention and vLLM
Last updated
Copyright Continuum Labs - 2023
Last updated
This November 2023 paper addresses the challenge of efficiently serving large language models (LLMs) in high-throughput scenarios.
LLMs have become increasingly popular for various applications, such as programming assistants and virtual assistants.
Serving these models is computationally expensive, requiring a large number of hardware accelerators like GPUs. To reduce the cost per request, you have to to increase the throughput of LLM serving systems.
The core of LLMs is an autoregressive Transformer model that generates tokens sequentially based on the input prompt and the previously generated tokens.
This process is repeated for each request until a termination token is generated. To improve throughput, multiple requests can be batched together.
However, the memory space for each request must be efficiently managed to process a large number of requests in a batch.
This paper identifies that existing LLM serving systems struggle to manage the key-value cache (KV cache) memory efficiently.
The KV cache stores the dynamic states (attention keys and values) associated with each request and represents the context from earlier tokens used to generate new output tokens.
The KV cache memory grows and shrinks dynamically as the model generates new tokens, and its lifetime and length are not known in advance.
In Transformer-based models, the attention mechanism uses a KV cache to store the keys and values associated with the tokens generated so far in a sequence.
This cache represents the context used to generate the next token. The KV cache grows dynamically as more tokens are generated, and its size is not known in advance.
Existing LLM serving systems allocate contiguous memory for the KV cache based on the maximum sequence length (e.g., 2048 tokens).
This approach leads to two main problems - inefficient memory use and batching
Internal fragmentation
When the actual sequence length is shorter than the maximum length, a significant portion of the allocated memory remains unused, resulting in wasted memory
External fragmentation
When requests have different maximum sequence lengths, the pre-allocated memory chunks vary in size, leading to inefficient memory utilisation.
Lack of memory sharing
Since the KV cache for each sequence is stored in separate contiguous memory spaces, it is not possible to share memory across sequences, even when they belong to the same request or have overlapping context.
The authors' profiling results show that only 20-40% of the KV cache memory is effectively used to store the actual token states in existing systems, highlighting the inefficiencies in memory management.
Requests may arrive at different times, and naïve batching strategies can lead to significant queueing delays. If earlier requests wait for later ones to form a batch, or if incoming requests are delayed until earlier ones finish, it results in increased latency.
Requests may have different input and output lengths. Straightforward batching techniques pad the inputs and outputs to equalise their lengths, wasting GPU computation and memory.
To address these issues, fine-grained batching mechanisms like cellular batching and iteration-level scheduling have been proposed.
These techniques operate at the iteration level instead of the request level.
After each iteration, completed requests are removed from the batch, and new ones are added. This allows new requests to be processed after waiting for a single iteration rather than waiting for the entire batch to complete.
To address these limitations, the authors propose PagedAttention, an attention algorithm inspired by virtual memory and paging in operating systems.
PagedAttention divides the KV cache into fixed-size blocks (similar to memory pages), which can be allocated on-demand and need not be stored in contiguous memory.
This approach reduces internal fragmentation by using smaller blocks and eliminates external fragmentation since all blocks have the same size. Additionally, it enables memory sharing at the block level, both within and across requests.
KV cache
In Transformer-based large language models (LLMs), the attention mechanism uses a KV cache to store the keys and values associated with the tokens generated so far in a sequence. This cache represents the context used to generate the next token. The size of the KV cache grows dynamically as more tokens are generated.
KV blocks
PagedAttention partitions the KV cache of each sequence into fixed-size KV blocks.
Logical and physical KV blocks
Logical KV blocks represent a request's KV cache as a series of blocks, filled from left to right as new tokens and their KV cache are generated.
Physical KV blocks are the actual contiguous chunks of memory allocated on the GPU DRAM or CPU RAM. PagedAttention allows logical blocks to be mapped to non-contiguous physical blocks, providing flexibility in memory allocation.
Block tables
vLLM maintains block tables that map logical KV blocks to physical KV blocks for each request.
Each block table entry records the corresponding physical blocks of a logical block and the number of filled positions. The block table enables efficient tracking and management of the KV cache memory.
The PagedAttention algorithm computes the attention scores and outputs using a block-wise approach.
In the PagedAttention algorithm, the key and value vectors are divided into fixed-size blocks to enable efficient memory management and computation.
Consider an example where the key and value vectors are spread across three blocks, which are not contiguous in physical memory (as shown in the figure below).
In summary, the PagedAttention algorithm allows the KV blocks to be stored in non-contiguous physical memory.
This is achieved by dividing the key and value vectors into blocks and performing the attention computation block-wise.
vLLM uses block tables to map logical KV blocks to physical KV blocks for each request.
The block table entries record the corresponding physical blocks and the number of filled positions for each logical block.
The figure below illustrates an example of block table translation in vLLM.
The logical KV blocks of a request are mapped to non-contiguous physical KV blocks in GPU memory.
As new tokens are generated, vLLM updates the block table and allocates new physical blocks as needed.
In summary, the PagedAttention algorithm, combined with the block table translation in vLLM, allows the KV cache to be efficiently managed and stored in non-contiguous physical memory.
This approach reduces memory fragmentation, enables dynamic growth of the KV cache, and improves memory utilisation in large language model serving systems.
vLLM is an end-to-end LLM serving system that uses the PagedAttention algorithm at its core.
The system consists of a FastAPI frontend and a GPU-based inference engine.
The frontend extends the OpenAI API interface, allowing users to customise sampling parameters for each request.
The vLLM engine is implemented in Python and C++/CUDA, with control-related components like the scheduler and block manager developed in Python, while custom CUDA kernels are used for key operations such as PagedAttention.
Popular LLMs like GPT, OPT, and LLaMA are implemented using PyTorch and Transformers, and NCCL is used for tensor communication across distributed GPU workers.
PagedAttention introduces new memory access patterns that are not efficiently supported by existing systems.
To optimise these patterns, vLLM develops several custom GPU kernels:
Fused reshape and block write
In each Transformer layer, the new KV cache is split into blocks, reshaped to a memory layout optimised for block read, and saved at positions specified by the block table. These operations are fused into a single kernel to minimise kernel launch overheads.
Fusing block read and attention
The attention kernel in FasterTransformer is adapted to read KV cache according to the block table and perform attention operations on the fly. A GPU warp is assigned to read each block to ensure coalesced memory access. Support for variable sequence lengths within a request batch is also added.
Fused block copy
Block copy operations, issued by the copy-on-write mechanism, may operate on discontinuous blocks.
To mitigate the overhead of numerous small data movements using the cudaMemcpyAsync API, vLLM implements a kernel that batches the copy operations for different blocks into a single kernel launch.
vLLM implements various decoding algorithms using three key methods:
fork: Creates a new sequence from an existing one.
append: Appends a new token to the sequence.
free: Deletes the sequence.
These methods are combined to support different decoding algorithms.
For example, in parallel sampling, vLLM creates multiple output sequences from a single input sequence using the fork method, adds new tokens to these sequences in every iteration with append, and deletes sequences that meet a stopping condition using free.
The same strategy is applied in beam search and prefix sharing.
The integration of PagedAttention in vLLM enables efficient memory management and sharing during the LLM serving process.
By partitioning the KV cache into fixed-size blocks and mapping logical blocks to non-contiguous physical blocks, vLLM reduces memory fragmentation and allows for dynamic memory allocation as new tokens are generated.
The integration of PagedAttention in vLLM, along with custom GPU kernels and support for various decoding algorithms, enables vLLM to deliver state-of-the-art performance while reducing memory fragmentation and overhead, ultimately making LLM serving more efficient and cost-effective.
In some applications, like program assistants, an LLM generates multiple sampled outputs for a single input prompt, allowing users to choose their preferred output.
PagedAttention and vLLM's memory management enable efficient memory sharing in this scenario.
When multiple outputs share the same input prompt, vLLM reserves space for only one copy of the prompt's KV cache. The logical blocks for the prompts of all sequences are mapped to the same physical blocks. This allows the KV cache of the shared prompt to be stored only once, saving memory.
For the generated outputs, vLLM uses a copy-on-write mechanism at the block level.
When a sequence needs to modify a shared block, vLLM creates a new physical block, copies the data from the original block, and updates the mapping for that sequence. This ensures that each sequence has its own copy of the modified block, while still sharing the unchanged blocks.
Beam search is a decoding algorithm that maintains a set of top-k most probable partial sequences (candidates) at each step. It allows the LLM to explore multiple high-probability paths and find the most likely output sequence.
With PagedAttention, vLLM enables memory sharing not only for the initial prompt blocks but also for other blocks across different candidates. As the beam search progresses, the candidates share common blocks and diverge only when necessary.
vLLM uses a reference counting mechanism to keep track of the number of candidates sharing each physical block. When a candidate is discarded, the reference counts of its blocks are decremented. When a reference count reaches zero, the corresponding physical block is freed and can be reused for other candidates or sequences.
In some applications, like machine translation, multiple input prompts may share a common prefix, such as a task description or examples. vLLM allows the LLM service provider to store the KV cache of the shared prefix in advance, reducing redundant computation.
The service provider can reserve a set of physical blocks for predefined shared prefixes.
When a user input prompt contains a shared prefix, vLLM maps the logical blocks of the prompt to the pre-computed physical blocks. This eliminates the need to recompute the KV cache for the shared prefix, saving computation time.
vLLM's PagedAttention allows for the simultaneous processing of requests with different decoding preferences, such as parallel sampling, beam search, and shared prefixes.
This is achieved through a common mapping layer that translates logical blocks to physical blocks.
The LLM and its execution kernel work with the physical block IDs provided by the scheduler, without needing to handle the complex memory sharing patterns between sequences. This abstraction enables vLLM to efficiently batch requests with different decoding requirements, improving overall system throughput.
In all these mechanisms, the core concept of PagedAttention remains the same: dividing the KV cache into fixed-size blocks and mapping logical blocks to physical blocks. This allows for flexible memory management, efficient memory sharing, and reduced memory footprint.
By leveraging the ideas of copy-on-write, reference counting, and pre-computation, vLLM extends the benefits of PagedAttention to various decoding scenarios, enabling efficient memory utilisation and improved performance in LLM serving systems.
Let's walk through a technical example of how vLLM and PagedAttention handle an incoming request to a Transformer-based language model.
Suppose a user sends a request to a Transformer-based LLM, such as GPT-3, with the following prompt: "The quick brown fox jumps over the lazy"
The LLM is tasked with generating the next word in the sentence. Let's assume the LLM generates the word "dog".
The vLLM frontend, built with FastAPI, receives the request containing the prompt and the desired output length (in this case, 1 token).
The frontend passes the request to the vLLM engine, which is responsible for processing the request and generating the output.
The vLLM engine tokenizes the prompt into individual tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy"]
vLLM allocates logical KV blocks to store the key and value vectors for the prompt tokens. In this example, let's assume a block size of 4 tokens.
Logical Block 0: ["The", "quick", "brown", "fox"]
Logical Block 1: ["jumps", "over", "the", "lazy"]
The logical blocks are mapped to physical KV blocks in the GPU memory. The block table maintains the mapping:
Logical Block 0 -> Physical Block 3
Logical Block 1 -> Physical Block 7
The PagedAttention algorithm is used to compute the attention scores and values.
For each token in the prompt, the query vector is multiplied with the key vectors in the corresponding KV block to compute the attention scores.
The attention scores are then used to compute the weighted sum of the value vectors, generating the context vector for each token.
The LLM's decoder takes the context vectors and generates the next token, "dog", based on the learned language model.
The key and value vectors for the generated token "dog" need to be added to the KV cache.
vLLM checks the block table and realizes that Logical Block 1 is not full (it contains 3 tokens out of the block size of 4).
The key and value vectors for "dog" are appended to Logical Block 1, which is mapped to Physical Block 7.
The block table is updated to reflect that Logical Block 1 now contains 4 tokens.
The generated token "dog" is returned to the vLLM frontend, which sends the response back to the user.
In this example, vLLM efficiently managed the KV cache using PagedAttention.
The prompt tokens were stored in logical KV blocks, which were mapped to physical blocks in the GPU memory.
When generating the new token, vLLM utilised the existing space in Logical Block 1 to store the key and value vectors, avoiding unnecessary memory allocation.
If the user were to send a follow-up request with the prompt "The quick brown fox jumps over the lazy dog", vLLM would use the existing KV blocks and allocate new ones as needed, efficiently reusing the memory and reducing fragmentation.
By leveraging PagedAttention and the block table, vLLM can dynamically manage the KV cache memory, allocate memory only when needed, and share memory across requests and sequences, ultimately leading to improved performance and reduced memory footprint in LLM serving.
Each block contains the key and value vectors for a fixed number of tokens, denoted as the KV block size . By using fixed-size blocks, PagedAttention reduces internal fragmentation and enables more efficient memory management.
The key block and value block represent the block of keys and values, respectively.
The key block contains the key vectors for a contiguous range of tokens, from token index to token index , where is the block size.
Similarly, the value block contains the value vectors for the same range of tokens.
For example, if the block size is 4, then:
would contain the key vectors for tokens 1 to 4
would contain the key vectors for tokens 5 to 8
would contain the value vectors for tokens 1 to 4
would contain the value vectors for tokens 5 to 8
The key and value blocks are used in the attention computation to calculate the attention scores and outputs efficiently by processing the tokens in chunks of size .
For each token (e.g., "forth"), the kernel multiplies its query vector with the key vectors in a block to compute the attention scores
Then, the kernel multiplies the attention scores with the value vectors in a block to derive the final attention output