LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • How can this be achieved?
  • KV Cache Problem
  • Memory fragmentation
  • Typical Batching Technique Issues
  • Paged Attention
  • Key concepts
  • PagedAttention algorithm
  • Example of PagedAttention
  • Block table translation in vLLM
  • How does vLLM relate to PagedAttention?
  • Other PagedAttention Methods
  • Application Example

Was this helpful?

  1. INFERENCE
  2. Why is inference important?

Paged Attention and vLLM

PreviousStreamingLLMNextTensorRT-LLM

Last updated 11 months ago

Was this helpful?

This November 2023 paper addresses the challenge of efficiently serving large language models (LLMs) in high-throughput scenarios.

LLMs have become increasingly popular for various applications, such as programming assistants and virtual assistants.

Serving these models is computationally expensive, requiring a large number of hardware accelerators like GPUs. To reduce the cost per request, you have to to increase the throughput of LLM serving systems.

How can this be achieved?

The core of LLMs is an autoregressive Transformer model that generates tokens sequentially based on the input prompt and the previously generated tokens.

This process is repeated for each request until a termination token is generated. To improve throughput, multiple requests can be batched together.

However, the memory space for each request must be efficiently managed to process a large number of requests in a batch.

This paper identifies that existing LLM serving systems struggle to manage the key-value cache (KV cache) memory efficiently.

The KV cache stores the dynamic states (attention keys and values) associated with each request and represents the context from earlier tokens used to generate new output tokens.

The KV cache memory grows and shrinks dynamically as the model generates new tokens, and its lifetime and length are not known in advance.

KV Cache Problem

In Transformer-based models, the attention mechanism uses a KV cache to store the keys and values associated with the tokens generated so far in a sequence.

This cache represents the context used to generate the next token. The KV cache grows dynamically as more tokens are generated, and its size is not known in advance.

Existing LLM serving systems allocate contiguous memory for the KV cache based on the maximum sequence length (e.g., 2048 tokens).

This approach leads to two main problems - inefficient memory use and batching

Memory fragmentation

Internal fragmentation

When the actual sequence length is shorter than the maximum length, a significant portion of the allocated memory remains unused, resulting in wasted memory

External fragmentation

When requests have different maximum sequence lengths, the pre-allocated memory chunks vary in size, leading to inefficient memory utilisation.

Lack of memory sharing

Since the KV cache for each sequence is stored in separate contiguous memory spaces, it is not possible to share memory across sequences, even when they belong to the same request or have overlapping context.

The authors' profiling results show that only 20-40% of the KV cache memory is effectively used to store the actual token states in existing systems, highlighting the inefficiencies in memory management.

Typical Batching Technique Issues

Queueing delays

Requests may arrive at different times, and naïve batching strategies can lead to significant queueing delays. If earlier requests wait for later ones to form a batch, or if incoming requests are delayed until earlier ones finish, it results in increased latency.

Inefficient padding

Requests may have different input and output lengths. Straightforward batching techniques pad the inputs and outputs to equalise their lengths, wasting GPU computation and memory.

To address these issues, fine-grained batching mechanisms like cellular batching and iteration-level scheduling have been proposed.

These techniques operate at the iteration level instead of the request level.

After each iteration, completed requests are removed from the batch, and new ones are added. This allows new requests to be processed after waiting for a single iteration rather than waiting for the entire batch to complete.

Paged Attention

To address these limitations, the authors propose PagedAttention, an attention algorithm inspired by virtual memory and paging in operating systems.

PagedAttention divides the KV cache into fixed-size blocks (similar to memory pages), which can be allocated on-demand and need not be stored in contiguous memory.

This approach reduces internal fragmentation by using smaller blocks and eliminates external fragmentation since all blocks have the same size. Additionally, it enables memory sharing at the block level, both within and across requests.

Key concepts

KV cache

In Transformer-based large language models (LLMs), the attention mechanism uses a KV cache to store the keys and values associated with the tokens generated so far in a sequence. This cache represents the context used to generate the next token. The size of the KV cache grows dynamically as more tokens are generated.

KV blocks

PagedAttention partitions the KV cache of each sequence into fixed-size KV blocks.

Logical and physical KV blocks

Logical KV blocks represent a request's KV cache as a series of blocks, filled from left to right as new tokens and their KV cache are generated.

Physical KV blocks are the actual contiguous chunks of memory allocated on the GPU DRAM or CPU RAM. PagedAttention allows logical blocks to be mapped to non-contiguous physical blocks, providing flexibility in memory allocation.

Block tables

vLLM maintains block tables that map logical KV blocks to physical KV blocks for each request.

Each block table entry records the corresponding physical blocks of a logical block and the number of filled positions. The block table enables efficient tracking and management of the KV cache memory.

PagedAttention algorithm

The PagedAttention algorithm computes the attention scores and outputs using a block-wise approach.

In the PagedAttention algorithm, the key and value vectors are divided into fixed-size blocks to enable efficient memory management and computation.

Example of PagedAttention

  • Consider an example where the key and value vectors are spread across three blocks, which are not contiguous in physical memory (as shown in the figure below).

In summary, the PagedAttention algorithm allows the KV blocks to be stored in non-contiguous physical memory.

This is achieved by dividing the key and value vectors into blocks and performing the attention computation block-wise.

Block table translation in vLLM

vLLM uses block tables to map logical KV blocks to physical KV blocks for each request.

The block table entries record the corresponding physical blocks and the number of filled positions for each logical block.

The figure below illustrates an example of block table translation in vLLM.

The logical KV blocks of a request are mapped to non-contiguous physical KV blocks in GPU memory.

As new tokens are generated, vLLM updates the block table and allocates new physical blocks as needed.

In summary, the PagedAttention algorithm, combined with the block table translation in vLLM, allows the KV cache to be efficiently managed and stored in non-contiguous physical memory.

This approach reduces memory fragmentation, enables dynamic growth of the KV cache, and improves memory utilisation in large language model serving systems.

Virtual Memory in Operating Systems and how it relates

Virtual memory is a memory management technique used by operating systems to provide a logical address space that is separate from the physical address space.

It allows programs to access memory as if it were a single, contiguous address space, even though the physical memory may be fragmented and spread across different locations.

In a virtual memory system, the operating system (OS) partitions the physical memory into fixed-size pages, typically 4KB in size.

Each program's memory is divided into logical pages of the same size. The OS maintains a page table for each process, which maps the logical pages to the physical pages in memory. When a program accesses a memory location, the OS translates the logical address to the corresponding physical address using the page table.

One of the main benefits of virtual memory is that it allows the OS to allocate physical memory to processes on-demand. When a process requests memory, the OS allocates logical pages to the process, but it doesn't necessarily allocate physical pages until the process actually accesses the memory. This is known as lazy allocation, and it allows the OS to efficiently manage memory by only allocating physical pages when they are needed.

Another benefit of virtual memory is that it allows the OS to swap pages between physical memory and secondary storage (e.g., disk) when there is not enough physical memory to hold all the pages.

When a process accesses a logical page that is not currently in physical memory, the OS can swap out a less frequently used page to disk and load the requested page into memory. This is known as paging, and it allows the OS to run programs that require more memory than is physically available.

How does vLLM relate to PagedAttention?

vLLM is an end-to-end LLM serving system that uses the PagedAttention algorithm at its core.

The system consists of a FastAPI frontend and a GPU-based inference engine.

The frontend extends the OpenAI API interface, allowing users to customise sampling parameters for each request.

What is Fast API and OpenAI APIs?

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints.

It is used to build the frontend of the vLLM serving system, which exposes an API for users to interact with the language model.

The OpenAI API is a set of APIs provided by OpenAI that allows developers to access and integrate the functionality of their language models into applications.

The vLLM frontend extends the OpenAI API interface, which means that it provides a compatible API that mimics the functionality and request/response format of the OpenAI API. This allows users who are familiar with the OpenAI API to easily integrate vLLM into their existing applications.

In the context of the vLLM serving system, the FastAPI frontend acts as the entry point for user requests. It receives requests containing prompts and generation parameters, validates them, and passes them to the vLLM engine for processing. Once the vLLM engine generates the output, the FastAPI frontend sends the response back to the user in the same format as the OpenAI API.

By using FastAPI and providing an OpenAI API-compatible interface, vLLM makes it simple for developers to leverage the power of the underlying language model in their applications, without having to learn a new API or make significant changes to their existing codebase.

The vLLM engine is implemented in Python and C++/CUDA, with control-related components like the scheduler and block manager developed in Python, while custom CUDA kernels are used for key operations such as PagedAttention.

PagedAttention introduces new memory access patterns that are not efficiently supported by existing systems.

To optimise these patterns, vLLM develops several custom GPU kernels:

Fused reshape and block write

In each Transformer layer, the new KV cache is split into blocks, reshaped to a memory layout optimised for block read, and saved at positions specified by the block table. These operations are fused into a single kernel to minimise kernel launch overheads.

Fusing block read and attention

The attention kernel in FasterTransformer is adapted to read KV cache according to the block table and perform attention operations on the fly. A GPU warp is assigned to read each block to ensure coalesced memory access. Support for variable sequence lengths within a request batch is also added.

Fused block copy

Block copy operations, issued by the copy-on-write mechanism, may operate on discontinuous blocks.

To mitigate the overhead of numerous small data movements using the cudaMemcpyAsync API, vLLM implements a kernel that batches the copy operations for different blocks into a single kernel launch.

vLLM implements various decoding algorithms using three key methods:

  1. fork: Creates a new sequence from an existing one.

  2. append: Appends a new token to the sequence.

  3. free: Deletes the sequence.

These methods are combined to support different decoding algorithms.

For example, in parallel sampling, vLLM creates multiple output sequences from a single input sequence using the fork method, adds new tokens to these sequences in every iteration with append, and deletes sequences that meet a stopping condition using free.

The same strategy is applied in beam search and prefix sharing.

The integration of PagedAttention in vLLM enables efficient memory management and sharing during the LLM serving process.

By partitioning the KV cache into fixed-size blocks and mapping logical blocks to non-contiguous physical blocks, vLLM reduces memory fragmentation and allows for dynamic memory allocation as new tokens are generated.

The integration of PagedAttention in vLLM, along with custom GPU kernels and support for various decoding algorithms, enables vLLM to deliver state-of-the-art performance while reducing memory fragmentation and overhead, ultimately making LLM serving more efficient and cost-effective.

Other PagedAttention Methods

Parallel Sampling

In some applications, like program assistants, an LLM generates multiple sampled outputs for a single input prompt, allowing users to choose their preferred output.

PagedAttention and vLLM's memory management enable efficient memory sharing in this scenario.

When multiple outputs share the same input prompt, vLLM reserves space for only one copy of the prompt's KV cache. The logical blocks for the prompts of all sequences are mapped to the same physical blocks. This allows the KV cache of the shared prompt to be stored only once, saving memory.

For the generated outputs, vLLM uses a copy-on-write mechanism at the block level.

When a sequence needs to modify a shared block, vLLM creates a new physical block, copies the data from the original block, and updates the mapping for that sequence. This ensures that each sequence has its own copy of the modified block, while still sharing the unchanged blocks.

Beam Search

Beam search is a decoding algorithm that maintains a set of top-k most probable partial sequences (candidates) at each step. It allows the LLM to explore multiple high-probability paths and find the most likely output sequence.

With PagedAttention, vLLM enables memory sharing not only for the initial prompt blocks but also for other blocks across different candidates. As the beam search progresses, the candidates share common blocks and diverge only when necessary.

vLLM uses a reference counting mechanism to keep track of the number of candidates sharing each physical block. When a candidate is discarded, the reference counts of its blocks are decremented. When a reference count reaches zero, the corresponding physical block is freed and can be reused for other candidates or sequences.

Reference Counting in Beam Search

In the beam search example, vLLM uses a reference counting mechanism to efficiently manage the sharing of physical blocks among different beam candidates.

Reference counting is a memory management technique used to track the number of references to a particular resource, in this case, a physical block. When a resource is no longer needed, it can be safely deallocated.

Here's how reference counting works in vLLM's beam search implementation

  1. Each physical block has an associated reference count, which represents the number of logical blocks (i.e., beam candidates) that are currently referring to it.

  2. When a new beam candidate is created and shares a physical block with an existing candidate, the reference count of that physical block is incremented.

  3. As the beam search progresses and candidates are discarded (e.g., candidates 0 and 3 in the example), the reference counts of the physical blocks associated with those candidates are decremented.

  4. When a physical block's reference count reaches zero, it means that no beam candidates are currently using that block, and it can be safely deallocated (e.g., blocks 2, 4, 5, and 8 in the example).

  5. When a new candidate needs to modify a shared physical block (e.g., when generating new tokens), vLLM applies the copy-on-write mechanism. It creates a new physical block, copies the data from the original block, and updates the reference counts accordingly.

The reference counting mechanism allows vLLM to efficiently manage the memory used by the beam candidates, as it enables the system to:

  • Share physical blocks among candidates when possible, reducing memory usage.

  • Keep track of when physical blocks are no longer needed and can be deallocated, preventing memory leaks.

  • Implement the copy-on-write mechanism, which allows candidates to modify shared blocks without affecting other candidates, while minimizing the amount of memory copying required.

By using reference counting, vLLM can efficiently manage the memory resources during the beam search process, enabling faster execution and reducing the overall memory footprint of the serving system.

Shared Prefix

In some applications, like machine translation, multiple input prompts may share a common prefix, such as a task description or examples. vLLM allows the LLM service provider to store the KV cache of the shared prefix in advance, reducing redundant computation.

The service provider can reserve a set of physical blocks for predefined shared prefixes.

When a user input prompt contains a shared prefix, vLLM maps the logical blocks of the prompt to the pre-computed physical blocks. This eliminates the need to recompute the KV cache for the shared prefix, saving computation time.

Mixed Decoding Methods

vLLM's PagedAttention allows for the simultaneous processing of requests with different decoding preferences, such as parallel sampling, beam search, and shared prefixes.

This is achieved through a common mapping layer that translates logical blocks to physical blocks.

The LLM and its execution kernel work with the physical block IDs provided by the scheduler, without needing to handle the complex memory sharing patterns between sequences. This abstraction enables vLLM to efficiently batch requests with different decoding requirements, improving overall system throughput.

In all these mechanisms, the core concept of PagedAttention remains the same: dividing the KV cache into fixed-size blocks and mapping logical blocks to physical blocks. This allows for flexible memory management, efficient memory sharing, and reduced memory footprint.

By leveraging the ideas of copy-on-write, reference counting, and pre-computation, vLLM extends the benefits of PagedAttention to various decoding scenarios, enabling efficient memory utilisation and improved performance in LLM serving systems.

Application Example

Let's walk through a technical example of how vLLM and PagedAttention handle an incoming request to a Transformer-based language model.

Example: Processing a Request with vLLM and PagedAttention

Suppose a user sends a request to a Transformer-based LLM, such as GPT-3, with the following prompt: "The quick brown fox jumps over the lazy"

The LLM is tasked with generating the next word in the sentence. Let's assume the LLM generates the word "dog".

Step 1: Request Handling

  • The vLLM frontend, built with FastAPI, receives the request containing the prompt and the desired output length (in this case, 1 token).

  • The frontend passes the request to the vLLM engine, which is responsible for processing the request and generating the output.

Step 2: Prompt Processing

  • The vLLM engine tokenizes the prompt into individual tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy"]

  • vLLM allocates logical KV blocks to store the key and value vectors for the prompt tokens. In this example, let's assume a block size of 4 tokens.

    • Logical Block 0: ["The", "quick", "brown", "fox"]

    • Logical Block 1: ["jumps", "over", "the", "lazy"]

  • The logical blocks are mapped to physical KV blocks in the GPU memory. The block table maintains the mapping:

    • Logical Block 0 -> Physical Block 3

    • Logical Block 1 -> Physical Block 7

Step 3: Attention Computation

  • The PagedAttention algorithm is used to compute the attention scores and values.

  • For each token in the prompt, the query vector is multiplied with the key vectors in the corresponding KV block to compute the attention scores.

  • The attention scores are then used to compute the weighted sum of the value vectors, generating the context vector for each token.

Step 4: Token Generation

  • The LLM's decoder takes the context vectors and generates the next token, "dog", based on the learned language model.

Step 5: KV Cache Update

  • The key and value vectors for the generated token "dog" need to be added to the KV cache.

  • vLLM checks the block table and realizes that Logical Block 1 is not full (it contains 3 tokens out of the block size of 4).

  • The key and value vectors for "dog" are appended to Logical Block 1, which is mapped to Physical Block 7.

  • The block table is updated to reflect that Logical Block 1 now contains 4 tokens.

Step 6: Output Generation

  • The generated token "dog" is returned to the vLLM frontend, which sends the response back to the user.

In this example, vLLM efficiently managed the KV cache using PagedAttention.

The prompt tokens were stored in logical KV blocks, which were mapped to physical blocks in the GPU memory.

When generating the new token, vLLM utilised the existing space in Logical Block 1 to store the key and value vectors, avoiding unnecessary memory allocation.

If the user were to send a follow-up request with the prompt "The quick brown fox jumps over the lazy dog", vLLM would use the existing KV blocks and allocate new ones as needed, efficiently reusing the memory and reducing fragmentation.

By leveraging PagedAttention and the block table, vLLM can dynamically manage the KV cache memory, allocate memory only when needed, and share memory across requests and sequences, ultimately leading to improved performance and reduced memory footprint in LLM serving.

Each block contains the key and value vectors for a fixed number of tokens, denoted as the KV block size (B)(B)(B). By using fixed-size blocks, PagedAttention reduces internal fragmentation and enables more efficient memory management.

The key block KjKjKj and value block VjVjVj represent the j−thj-thj−th block of keys and values, respectively.

The key block KjKjKj contains the key vectors for a contiguous range of tokens, from token index (B(j−1)+1)(B(j-1)+1) (B(j−1)+1)to token index BjBjBj, where BBB is the block size.

Similarly, the value block VjVj Vjcontains the value vectors for the same range of tokens.

For example, if the block size BBB is 4, then:

K1K1K1 would contain the key vectors for tokens 1 to 4

K2K2K2 would contain the key vectors for tokens 5 to 8

V1V1V1 would contain the value vectors for tokens 1 to 4

V2V2V2 would contain the value vectors for tokens 5 to 8

The key and value blocks are used in the attention computation to calculate the attention scores and outputs efficiently by processing the tokens in chunks of size BBB.

For each token (e.g., "forth"), the kernel multiplies its query vector (qi)(q_i)(qi​) with the key vectors (Kj)(K_j) (Kj​)in a block to compute the attention scores (Aij). (A_ij).(Ai​j).

Then, the kernel multiplies the attention scores (Aij) (A_ij)(Ai​j)with the value vectors (Vj)(V_j)(Vj​) in a block to derive the final attention output (oi).(o_i).(oi​).

Popular LLMs like GPT, OPT, and LLaMA are implemented using PyTorch and Transformers, and is used for tensor communication across distributed GPU workers.

NCCL
LogoEfficient Memory Management for Large Language Model Serving with...arXiv.org
Efficient Memory Management for Large Language Model Serving with Paged Attention
Block table translation in vLLM
Illustration of the PagedAttention algorithm, where the attention key and values vectors are stored as non-contiguous blocks in the memory
vLLM system overview
Parallel sampling example
Beam search example
Shared prompt example for machine translation
Page cover image