Mixed precision training

This February 2018 paper introduces a methodology for training deep neural networks using half-precision (FP16) floating point numbers without sacrificing model accuracy or requiring hyperparameter modifications.

The authors propose three key techniques to overcome challenges associated with the reduced precision format:

Maintaining an FP32 master copy of weights that accumulates gradients after each optimizer step. This master copy is rounded to FP16 for forward and backward passes.

Loss scaling to preserve small gradient values that would otherwise be lost due to the limited range of FP16. Scaling the loss value prior to backpropagation shifts relevant gradients into the representable range.

Accumulating FP16 products into FP32 for certain arithmetic operations like dot products, while performing others in FP16. This maintains fidelity in crucial network calculations.

Key observations and results

FP16 training matches FP32 accuracy with no hyperparameter tuning in most cases. Loss scaling is needed for some models like SSD, machine translation, and language modelling to preserve small gradients.
FP16 reduces memory consumption and arithmetic time compared to FP32, enabling 2-6x speedups on bandwidth-limited operations on Volta GPUs with tensor cores.
The FP32 master copy of weights is crucial for convergence, without which models like DeepSpeech 2 Mandarin suffer an 80% relative accuracy loss.
Speech recognition experiments are the largest models trained, with up to 215M parameters. Interestingly, FP16 slightly outperforms FP32 (5-10%) on these tasks, possibly due to a regularization effect.

Overall, this work demonstrates that reduced precision is a viable approach for accelerating DNN training across a variety of domains without compromising model quality.

It overcomes many of the pitfalls of previous attempts at FP16 training.

The techniques are straightforward to implement and exhibit promising results on modern tensor core hardware.

This sets the stage for wider adoption of reduced precision training to make more efficient use of computational resources and potentially open up new frontiers in deep learning research.

PreviousPadding Tokens NextFP8 Formats for Deep Learning

Last updated 1 year ago

Was this helpful?