Continuum Website Continuum Applications Continuum Knowledge Axolotl Platform

Key Value Cache

Managing model memory usage

PreviousGrouped Query Attention NextFlash Attention

Last updated 6 months ago

Continuum - Accelerated Artificial Intelligence

Continuum Website Axolotl Platform

Key Value Cache

Managing model memory usage

The KV (Key-Value) Cache in large language models (LLMs) plays an important role in the model's operation and its memory usage.

Function of the KV Cache

In Transformer models, the attention mechanism is a core component.

For each token being processed, the model generates key $(K)$ and value $(V)$ vectors based on the token's embedding and the weights of the model.

These K and V vectors are stored in the KV Cache.

The KV Cache is used to store the context from earlier tokens, which is necessary for the model to generate new output tokens in a sequence. Essentially, it holds the information that the model uses to understand the sequence it has processed so far and to predict the next token.

Size and Growth of the KV Cache

The size of the KV Cache is substantial and grows with the number of tokens being processed.

For instance, in the 13 billion parameter OPT model, the KV Cache for a single token requires around 800 KB of space.

This calculation is based on the dimensions of the key and value vectors, the number of hidden states, the number of layers in the model, and the size (in bytes) of each element (assuming FP16 precision).

As the model can generate sequences up to 2048 tokens, the memory needed for the KV Cache of a single request can reach up to approximately 1.6 GB.

Memory Capacity Constraints

Given that contemporary GPUs typically have memory capacities in the tens of GBs, the space available for KV Caches is limited. If all available memory were allocated to the KV Cache, only a few tens of requests could be processed concurrently.

Inefficient memory management can amplify this issue, leading to even smaller batch sizes. This is because the space needed for the KV Cache must be pre-allocated and remains reserved for the entirety of a request, even if the actual sequence length turns out to be shorter than the maximum allocated length.

Computational Speed vs. Memory Capacity

The growth in GPU computation speed is outpacing the increase in memory capacity. For example, while the FLOPS (floating-point operations per second) of GPUs like NVIDIA's A100 to H100 have more than doubled, the maximum GPU memory has remained static at around 80 GB.
This disparity suggests that memory, particularly for the KV Cache in large language models, will become an increasingly significant bottleneck in LLM serving.

In summary, the KV Cache is a critical component for the functioning of large language models, but its substantial size and the way it is managed pose significant challenges in terms of memory usage and efficient model serving.

The growth in model complexity and size further exacerbates these challenges, making efficient memory management a key area of focus for LLM optimization.