Model Requirements
Determine the number of parameters in the model (P)
The model size is often expressed in billions (B) of parameters.
For example, a 7B model has 7 billion parameters.
Identify the data type used for the model parameters
Common data types include:
float (32-bit floating point): 4 bytes per parameter
half/BF16 (16-bit floating point): 2 bytes per parameter
int8 (8-bit integer): 1 byte per parameter
int4 (4-bit integer): 0.5 bytes per parameter
Calculate the storage size of the model (S)
Multiply the number of parameters (P) by the size of the data type.
For example, a 7B model using BF16 would have a storage size of: S = 7 billion * 2 bytes = 14 billion bytes ≈ 14 GB
Estimate the memory required for inference (M_inf)
The memory required for inference is approximately equal to the storage size (S).
M_inf ≈ S
Estimate the memory required for training (M_train)
Training typically requires 3 to 4 times the memory needed for inference.
A conservative estimate is to multiply the inference memory (M_inf) by a factor of 4.
M_train ≈ M_inf * 4
For example, training a 7B model using float parameters would require: M_train ≈ 7 billion * 4 bytes * 4 = 112 GB
Consider memory requirements for gradients and optimizer states
During training, additional memory is needed for gradients and optimiser states.
The memory required for gradients is equal to the number of parameters (P).
The memory required for optimizer states depends on the optimizer used:
AdamW optimizer: 2 * P
SGD optimizer: P
Adjust for additional memory overhead
Training may require additional memory for intermediate computations and data storage.
Add a safety margin of 10-20% to the estimated training memory (M_train).
Consider memory-efficient training techniques
Techniques like LoRA (Low-Rank Adaptation) and QLoRA can reduce memory requirements.
These techniques involve training a smaller model while running inference on the original model.
The total memory used is the sum of the memory required for inference on the original model and the memory needed for training the smaller model.
Here's an example calculation for training a 13B model using float parameters:
Number of parameters (P) = 13 billion
Data type: float (4 bytes per parameter)
Storage size (S) = 13 billion * 4 bytes ≈ 52 GB
Inference memory (M_inf) ≈ 52 GB
Training memory (M_train) ≈ 52 GB * 4 = 208 GB
Additional memory for gradients = 13 billion * 4 bytes ≈ 52 GB
Additional memory for AdamW optimizer states = 2 * 13 billion * 4 bytes ≈ 104 GB
Total estimated memory for training = 208 GB + 52 GB + 104 GB = 364 GB
With a 20% safety margin, the final estimate would be: 364 GB * 1.2 ≈ 437 GB
This framework provides a rough estimate of the memory requirements for training LLMs based on storage usage. However, actual memory usage may vary depending on the specific model architecture, implementation details, and hardware characteristics. It's always a good idea to have some extra memory available to accommodate any additional overhead or unexpected memory usage during training.
Last updated