# Mixed precision training

This <mark style="color:blue;">February 2018</mark> paper introduces a methodology for training deep neural networks using half-precision (FP16) floating point numbers without sacrificing model accuracy or requiring hyperparameter modifications.&#x20;

{% embed url="<https://arxiv.org/abs/1710.03740>" %}
Mixed precision training
{% endembed %}

The authors propose three key techniques to overcome challenges associated with the reduced precision format:

<mark style="color:green;">Maintaining an FP32 master copy of weights</mark> that accumulates gradients after each optimizer step. This master copy is rounded to FP16 for forward and backward passes.

<mark style="color:green;">**Loss scaling to preserve small gradient value**</mark><mark style="color:green;">s</mark> that would otherwise be lost due to the limited range of FP16. Scaling the loss value prior to backpropagation shifts relevant gradients into the representable range.

<mark style="color:green;">**Accumulating FP16 products into FP32 for certain arithmetic operations**</mark> like dot products, while performing others in FP16. This maintains fidelity in crucial network calculations.

### <mark style="color:purple;">Key observations and results</mark>

* FP16 training matches FP32 accuracy with no hyperparameter tuning in most cases. Loss scaling is needed for some models like SSD, machine translation, and language modelling to preserve small gradients.
* FP16 reduces memory consumption and arithmetic time compared to FP32, enabling 2-6x speedups on bandwidth-limited operations on Volta GPUs with tensor cores.
* The FP32 master copy of weights is crucial for convergence, without which models like DeepSpeech 2 Mandarin suffer an 80% relative accuracy loss.
* Speech recognition experiments are the largest models trained, with up to 215M parameters. Interestingly, FP16 slightly outperforms FP32 (5-10%) on these tasks, possibly due to a regularization effect.

Overall, this work demonstrates that reduced precision is a viable approach for accelerating DNN training across a variety of domains without compromising model quality.&#x20;

It overcomes many of the pitfalls of previous attempts at FP16 training. &#x20;

The techniques are straightforward to implement and exhibit promising results on modern tensor core hardware.&#x20;

This sets the stage for wider adoption of reduced precision training to make more efficient use of computational resources and potentially open up new frontiers in deep learning research.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/training/the-fine-tuning-process/hyperparameters/mixed-precision-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
