Calculating GPU memory for serving LLMs
Last updated
Last updated
Copyright Continuum Labs - 2023
How many GPUs do I need to be able to serve Llama3 70 billion parameter model?
In order to answer that, you need to know how much GPU memory will be required by the model. The formula is:
GPU memory expressed in Gigabyte
The amount of parameters in the model
4 bytes, expressing the bytes used for each parameter
There are 32 bits in 4 bytes
The amount of bits that should be used for loading the model. - 16 bits, 8 bits or 4 bits.
Represents a 20% overhead of loading additional things in GPU memory.
Let's try it out for Llama 70 billion parameter model that we will load in 16 bit number format.
To run this model, you would require two NVIDIA A-100 80GB memory models.
A single A100 80GB wouldn't be enough, although 2x A100 80GB should be enough to serve the Llama How to further reduce GPU memory required for Llama 2 70B?
To calculate the GPU memory requirements for training a model like Llama3 with 70 billion parameters using different precision levels such as FP8 (8-bit floating-point), we need to adjust our formula to fit the new context.
Let's define a general formula first, and then apply it specifically for FP8.
For the FP8 precision:
Bytes per parameter (b) is 1 byte
Bit width used (Q) is 8 bits
- 70 billion parameters
- byte per parameter (since 8 bits = 1 byte)
- representing a 20% overhead
= 8 bits
Substitute the values into the formula:
Next:
Next:
Work out the memory requirements:
Therefore, the memory requirement for training the Llama3 model with 70 billion parameters using FP8 precision is a much lower 21 GB.
The model size is often expressed in billions (B) of parameters.
For example, a 7B model has 7 billion parameters.
Common data types include:
float (32-bit floating point): 4 bytes per parameter
half/BF16 (16-bit floating point): 2 bytes per parameter
int8 (8-bit integer): 1 byte per parameter
int4 (4-bit integer): 0.5 bytes per parameter
Multiply the number of parameters (P) by the size of the data type.
For example, a 7B model using BF16 would have a storage size of: S = 7 billion * 2 bytes = 14 billion bytes ≈ 14 GB
The memory required for inference is approximately equal to the storage size (S).
M_inf ≈ S
Training typically requires 3 to 4 times the memory needed for inference.
A conservative estimate is to multiply the inference memory (M_inf) by a factor of 4.
M_train ≈ M_inf * 4
For example, training a 7B model using float parameters would require: M_train ≈ 7 billion * 4 bytes * 4 = 112 GB
During training, additional memory is needed for gradients and optimiser states.
The memory required for gradients is equal to the number of parameters (P).
The memory required for optimizer states depends on the optimizer used:
AdamW optimizer: 2 * P
SGD optimizer: P
Adjust for additional memory overhead
Training may require additional memory for intermediate computations and data storage.
Add a safety margin of 10-20% to the estimated training memory (M_train).
Techniques like LoRA (Low-Rank Adaptation) and QLoRA can reduce memory requirements.
These techniques involve training a smaller model while running inference on the original model.
The total memory used is the sum of the memory required for inference on the original model and the memory needed for training the smaller model.
Here's an example calculation for training a 13B model using float parameters:
Number of parameters (P) = 13 billion
Data type: float (4 bytes per parameter)
Storage size (S) = 13 billion * 4 bytes ≈ 52 GB
Inference memory (M_inf) ≈ 52 GB
Training memory (M_train) ≈ 52 GB * 4 = 208 GB
Additional memory for gradients = 13 billion * 4 bytes ≈ 52 GB
Additional memory for AdamW optimizer states = 2 * 13 billion * 4 bytes ≈ 104 GB
Total estimated memory for training = 208 GB + 52 GB + 104 GB = 364 GB
With a 20% safety margin, the final estimate would be: 364 GB * 1.2 ≈ 437 GB
This framework provides a rough estimate of the memory requirements for training LLMs based on storage usage. However, actual memory usage may vary depending on the specific model architecture, implementation details, and hardware characteristics. It's always a good idea to have some extra memory available to accommodate any additional overhead or unexpected memory usage during training.