Calculating GPU memory for serving LLMs

Calculating GPU memory for serving LLMs

How many GPUs do I need to be able to serve Llama3 70 billion parameter model?

In order to answer that, you need to know how much GPU memory will be required by the model. The formula is:

M=((P×4B)(32/Q))×1.2  M = \left(\frac{{(P \times 4B)}}{{(32/Q)}}\right) \times 1.2 \
SymbolDescription

MM

GPU memory expressed in Gigabyte

PP

The amount of parameters in the model

4B4B

4 bytes, expressing the bytes used for each parameter

3232

There are 32 bits in 4 bytes

QQ

The amount of bits that should be used for loading the model. - 16 bits, 8 bits or 4 bits.

1.21.2

Represents a 20% overhead of loading additional things in GPU memory.

GPU memory required for serving Llama 70B

Let's try it out for Llama 70 billion parameter model that we will load in 16 bit number format.

M=((70×109×4 bytes)(32/16))×1.2=(140×109 bytes)×1.2=168 GB M = \left(\frac{{(70 \times 10^9 \times 4 \text{ bytes})}}{{(32/16)}}\right) \times 1.2 = \left(140 \times 10^9 \text{ bytes}\right) \times 1.2 = 168 \text{ GB}

To run this model, you would require two NVIDIA A-100 80GB memory models.

A single A100 80GB wouldn't be enough, although 2x A100 80GB should be enough to serve the Llama How to further reduce GPU memory required for Llama 2 70B?

Using FP8 (8-bit floating-point)

To calculate the GPU memory requirements for training a model like Llama3 with 70 billion parameters using different precision levels such as FP8 (8-bit floating-point), we need to adjust our formula to fit the new context.

Let's define a general formula first, and then apply it specifically for FP8.

For the FP8 precision:

  • Bytes per parameter (b) is 1 byte.

  • Bit width used (Q) is 8 bits.

  • P=70×109 P = 70 \times 10^9 - 70 billion parameters

  • b=1 b = 1 - byte per parameter (since 8 bits = 1 byte)

  • overhead=1.2\text{overhead} = 1.2 - representing a 20% overhead

  • QQ = 8 bits

Substitute the values into the formula:

Next:

Next:

Work out the memory requirements:

Therefore, the memory requirement for training the Llama3 model with 70 billion parameters using FP8 precision is a muc lower 21 GB.

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023