Calculating GPU memory for serving LLMs

Calculating GPU memory for serving LLMs

How many GPUs do I need to be able to serve Llama3 70 billion parameter model?

In order to answer that, you need to know how much GPU memory will be required by the model. The formula is:

M=((P×4B)(32/Q))×1.2  M = \left(\frac{{(P \times 4B)}}{{(32/Q)}}\right) \times 1.2 \


GPU memory expressed in Gigabyte


The amount of parameters in the model


4 bytes, expressing the bytes used for each parameter


There are 32 bits in 4 bytes


The amount of bits that should be used for loading the model. - 16 bits, 8 bits or 4 bits.


Represents a 20% overhead of loading additional things in GPU memory.

GPU memory required for serving Llama 70B

Let's try it out for Llama 70 billion parameter model that we will load in 16 bit number format.

M=((70×109×4 bytes)(32/16))×1.2=(140×109 bytes)×1.2=168 GB M = \left(\frac{{(70 \times 10^9 \times 4 \text{ bytes})}}{{(32/16)}}\right) \times 1.2 = \left(140 \times 10^9 \text{ bytes}\right) \times 1.2 = 168 \text{ GB}

To run this model, you would require two NVIDIA A-100 80GB memory models.

A single A100 80GB wouldn't be enough, although 2x A100 80GB should be enough to serve the Llama How to further reduce GPU memory required for Llama 2 70B?

Using FP8 (8-bit floating-point)

To calculate the GPU memory requirements for training a model like Llama3 with 70 billion parameters using different precision levels such as FP8 (8-bit floating-point), we need to adjust our formula to fit the new context.

Let's define a general formula first, and then apply it specifically for FP8.

For the FP8 precision:

  • Bytes per parameter (b) is 1 byte.

  • Bit width used (Q) is 8 bits.

  • P=70×109 P = 70 \times 10^9 - 70 billion parameters

  • b=1 b = 1 - byte per parameter (since 8 bits = 1 byte)

  • overhead=1.2\text{overhead} = 1.2 - representing a 20% overhead

  • QQ = 8 bits

Substitute the values into the formula:



Work out the memory requirements:

Therefore, the memory requirement for training the Llama3 model with 70 billion parameters using FP8 precision is a muc lower 21 GB.

Last updated


Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023