# Mixed precision training

This <mark style="color:blue;">February 2018</mark> paper introduces a methodology for training deep neural networks using half-precision (FP16) floating point numbers without sacrificing model accuracy or requiring hyperparameter modifications.&#x20;

{% embed url="<https://arxiv.org/abs/1710.03740>" %}
Mixed precision training
{% endembed %}

The authors propose three key techniques to overcome challenges associated with the reduced precision format:

<mark style="color:green;">Maintaining an FP32 master copy of weights</mark> that accumulates gradients after each optimizer step. This master copy is rounded to FP16 for forward and backward passes.

<mark style="color:green;">**Loss scaling to preserve small gradient value**</mark><mark style="color:green;">s</mark> that would otherwise be lost due to the limited range of FP16. Scaling the loss value prior to backpropagation shifts relevant gradients into the representable range.

<mark style="color:green;">**Accumulating FP16 products into FP32 for certain arithmetic operations**</mark> like dot products, while performing others in FP16. This maintains fidelity in crucial network calculations.

### <mark style="color:purple;">Key observations and results</mark>

* FP16 training matches FP32 accuracy with no hyperparameter tuning in most cases. Loss scaling is needed for some models like SSD, machine translation, and language modelling to preserve small gradients.
* FP16 reduces memory consumption and arithmetic time compared to FP32, enabling 2-6x speedups on bandwidth-limited operations on Volta GPUs with tensor cores.
* The FP32 master copy of weights is crucial for convergence, without which models like DeepSpeech 2 Mandarin suffer an 80% relative accuracy loss.
* Speech recognition experiments are the largest models trained, with up to 215M parameters. Interestingly, FP16 slightly outperforms FP32 (5-10%) on these tasks, possibly due to a regularization effect.

Overall, this work demonstrates that reduced precision is a viable approach for accelerating DNN training across a variety of domains without compromising model quality.&#x20;

It overcomes many of the pitfalls of previous attempts at FP16 training. &#x20;

The techniques are straightforward to implement and exhibit promising results on modern tensor core hardware.&#x20;

This sets the stage for wider adoption of reduced precision training to make more efficient use of computational resources and potentially open up new frontiers in deep learning research.
