Model Requirements

Determine the number of parameters in the model (P)

  • The model size is often expressed in billions (B) of parameters.

  • For example, a 7B model has 7 billion parameters.

Identify the data type used for the model parameters

  • Common data types include:

    • float (32-bit floating point): 4 bytes per parameter

    • half/BF16 (16-bit floating point): 2 bytes per parameter

    • int8 (8-bit integer): 1 byte per parameter

    • int4 (4-bit integer): 0.5 bytes per parameter

Calculate the storage size of the model (S)

  • Multiply the number of parameters (P) by the size of the data type.

  • For example, a 7B model using BF16 would have a storage size of: S = 7 billion * 2 bytes = 14 billion bytes ≈ 14 GB

Estimate the memory required for inference (M_inf)

  • The memory required for inference is approximately equal to the storage size (S).

  • M_inf ≈ S

Estimate the memory required for training (M_train)

  • Training typically requires 3 to 4 times the memory needed for inference.

  • A conservative estimate is to multiply the inference memory (M_inf) by a factor of 4.

  • M_train ≈ M_inf * 4

  • For example, training a 7B model using float parameters would require: M_train ≈ 7 billion * 4 bytes * 4 = 112 GB

Consider memory requirements for gradients and optimizer states

  • During training, additional memory is needed for gradients and optimiser states.

  • The memory required for gradients is equal to the number of parameters (P).

  • The memory required for optimizer states depends on the optimizer used:

    • AdamW optimizer: 2 * P

    • SGD optimizer: P

Adjust for additional memory overhead

  • Training may require additional memory for intermediate computations and data storage.

  • Add a safety margin of 10-20% to the estimated training memory (M_train).

Consider memory-efficient training techniques

  • Techniques like LoRA (Low-Rank Adaptation) and QLoRA can reduce memory requirements.

  • These techniques involve training a smaller model while running inference on the original model.

  • The total memory used is the sum of the memory required for inference on the original model and the memory needed for training the smaller model.

Here's an example calculation for training a 13B model using float parameters:

  • Number of parameters (P) = 13 billion

  • Data type: float (4 bytes per parameter)

  • Storage size (S) = 13 billion * 4 bytes ≈ 52 GB

  • Inference memory (M_inf) ≈ 52 GB

  • Training memory (M_train) ≈ 52 GB * 4 = 208 GB

  • Additional memory for gradients = 13 billion * 4 bytes ≈ 52 GB

  • Additional memory for AdamW optimizer states = 2 * 13 billion * 4 bytes ≈ 104 GB

  • Total estimated memory for training = 208 GB + 52 GB + 104 GB = 364 GB

  • With a 20% safety margin, the final estimate would be: 364 GB * 1.2 ≈ 437 GB

This framework provides a rough estimate of the memory requirements for training LLMs based on storage usage. However, actual memory usage may vary depending on the specific model architecture, implementation details, and hardware characteristics. It's always a good idea to have some extra memory available to accommodate any additional overhead or unexpected memory usage during training.

Last updated


Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023