Page cover image

FP8 Formats for Deep Learning

This September 2022 paper is a collaborative work by researchers from NVIDIA, Arm, and Intel.

The authors propose an 8-bit floating-point (FP8) binary interchange format for deep learning training and inference, aiming to reduce computational requirements while maintaining result quality.

The paper presents a comprehensive study of the proposed FP8 format for deep learning training and inference.

The authors demonstrate that FP8 can effectively match the result quality of 16-bit training sessions across a wide range of tasks, model architectures, and sizes, without changing hyperparameters.

The authors demonstrate the effectiveness of the FP8 format on various image and language tasks, covering modern neural network architectures such as CNNs, RNNs, and Transformer-based models.

They show that FP8 training can effectively match the result quality achieved by 16-bit training sessions without changing any hyperparameters.

The study includes large language models with up to 175 billion parameters.

The paper also examines FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed-point int8 quantization.

FP8 is a natural progression for accelerating deep learning (DL) training beyond the 16-bit formats common in modern processors. DL applications require two 8-bit floating point (FP8) binary interchange formats, both supported by Hopper and Ada GPU architectures: E4M3 and E5M2. These types enable doubling the math throughput as well as reducing bandwidth pressure in half; however, their use requires some care due to their narrower range and lower precision compared to the 16-bit formats. We'll cover three aspects of FP8 for deep learning:

The presentation below is from NVIDIA's March 2023 NTC Session:

1MB
NVIDIA Documentation on FP8.pdf
pdf

Key points from the paper

  1. Reduced precision representation of numbers has been important for deep learning training and inference acceleration.

  2. Common floating-point types for training include IEEE single precision, TF32, IEEE half precision, and bfloat16.

  3. For inference, fixed-point int8 representation is popular, but it can encounter challenges in maintaining the required accuracy for some applications.

  4. The authors propose an FP8 binary format with two encodings: E4M3 and E5M2.

  5. The effectiveness of the FP8 format is demonstrated on various image and language tasks, covering modern neural network architectures.

  6. FP8 training matches FP16 or bfloat16 training results without changing any model or optimizer hyperparameters.

  7. The study includes the training of very large language models, up to 175B parameters.

  8. FP8 post-training-quantization is examined for language models trained using 16-bit formats that resisted fixed-point int8 quantization.

The paper discusses several technical aspects of using FP8 formats in deep learning.

Precision of mathematical operations

  • When performing mathematical operations on FP8 inputs, the outputs are usually produced in a higher precision format, such as single-precision floating-point (FP32).

  • This is similar to how operations on 16-bit floating-point formats (FP16 and bfloat16) are handled in current CPUs, GPUs, and TPUs.

  • For example, matrix multiplication or dot-product instructions produce FP32 outputs, while simpler operations like nonlinearities or normalizations are performed after casting the FP8 inputs to FP32.

Scaling factors

  • To better utilise the limited range of FP8 formats, higher-precision values need to be multiplied by a scaling factor before being cast to FP8.

  • This process is similar to the loss-scaling technique used in mixed-precision training with FP16, where gradients are scaled to fit within the FP16 range.

  • Some networks may require per-tensor scaling factors because the FP8 dynamic range is not sufficient to cover the entire range of important values across all tensors.

  • The general idea is to choose a scaling factor that brings the maximum magnitude in the tensor close to the maximum representable magnitude in the corresponding FP8 format.

  • Values that overflow are then saturated to the maximum representable value.

Unscaling

  • After converting FP8 values back to a higher precision or after performing arithmetic instructions that produce a higher-precision output, the values need to be unscaled by multiplying them with the inverse of the scaling factor.

  • This requires only a minimal amount of additional arithmetic and is amortized over many multiply-accumulate operations with FP8 inputs.

In summary, the technical aspects discussed in the paper focus on the precision of mathematical operations, the use of scaling factors, unscaling, type conversion, and the specific details of the FP8 formats.

These considerations are crucial for effectively using FP8 in deep learning while maintaining accuracy and performance.

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023