# Floating Point Numbers

Floating point numbers are a way for computers to approximately represent real numbers.

They allow a wide range of values to be stored, from very small to very large numbers, but with limited precision.

This is a tradeoff - floating point sacrifices exactness for speed, efficiency, and the ability to handle numbers across many orders of magnitude.

**The basic idea is similar to scientific notation. **

Just like how you can write very big or small numbers like 6.022 x 10^23 or 1.67 x 10^-27, floating point represents numbers as a mantissa multiplied by 2 raised to an exponent.

The mantissa holds the significant digits while the exponent indicates where the binary point should be placed relative to those digits.

The mantissa, also known as the significand, is the part of a floating-point number that holds the significant digits.

In the IEEE 754 standard, it's the fractional part that comes after the implied leading 1. So if we have a binary number like 1.01011, the mantissa bits would be 01011.

In the standard IEEE 754 32-bit floating point format:

The first bit is the sign bit (0 for positive, 1 for negative)

The next 8 bits are the exponent

The final 23 bits are the mantissa

The exponent is stored with a bias of 127. This allows it to represent both positive and negative powers of 2, from around -126 to 127. The mantissa bits are the fractional part after an implied leading 1 bit. So 1.xxxxxxxx where the x's are the stored 23 bits.

Some special values:

If the exponent is all 0s, and mantissa is 0, the number is 0

If the exponent is all 0s but mantissa is non-zero, it's a subnormal number very close to 0

If the exponent is all 1s and mantissa is 0, the value is infinity (positive or negative)

If the exponent is all 1s and mantissa is non-zero, the value is NaN (Not a Number)

Floating point allows large dynamic range but not infinite precision.

Adding more bits to the format increases precision, but there are always some numbers that can't be exactly represented, like how 1/3 stored in decimal is always an approximation.

So 0.1 + 0.2 might not exactly equal 0.3, and 1/10 + 2/10 often doesn't exactly equal 3/10.

The results are very close to the true value but not always exact, limited by the precision of the format. This can lead to surprising behaviours in calculations sometimes.

Overall, floating point is a clever way to balance dynamic range, precision, speed and efficiency in storing real numbers. The vast majority of the time it works great, but it's an approximation - floating point numbers aren't exactly the same as mathematical real numbers. Understanding the tradeoffs and limitations is important for using them effectively.

Here's a fun little Python script that demonstrates how a floating-point number is constructed from its parts:

In this script, we manually construct a floating-point number by specifying the sign, exponent, and mantissa.

We then convert this to an actual float and print out its value and binary representation.

Now let's have some fun with floating-point precision!

This classic example demonstrates how floating-point numbers can sometimes yield surprising results due to their limited precision.

Here's another fun one:

You might expect this loop to print the numbers from 0 to 9 and then stop, but it actually runs forever!

This is because 0.1 cannot be exactly represented as a floating-point number, so each addition introduces a tiny error.

These errors accumulate, so the value of `i`

never exactly equals 10.

Floating-point numbers are a fascinating topic, and there's a lot more to explore! I hope these examples have given you a fun introduction to how they work under the hood.

### Floating Point Numbers in Deep Learning

In deep learning, floating-point numbers are ubiquitous.

They're used to represent weights, biases, inputs, outputs, and intermediate values in neural networks.

The most common floating-point formats in deep learning are 32-bit single-precision (FP32) and 16-bit half-precision (FP16).

When a floating-point number is stored in memory or in a register, the mantissa is stored in the least significant bits.

For example, in FP32, the mantissa is stored in bits 0-22, while the exponent is stored in bits 23-30, and the sign bit is stored in bit 31.

#### Here's a visual representation

In practice, deep learning frameworks and hardware accelerators (like GPUs and TPUs) handle the storage and manipulation of these floating-point numbers behind the scenes.

As a deep learning practitioner, you typically work with higher-level abstractions like tensors, which are multi-dimensional arrays of floating-point numbers.

However, understanding how floating-point numbers are represented can be important for certain aspects of deep learning, such as:

**Model quantization: **This is a technique where the weights and activations of a neural network are converted from FP32 to a lower-precision format like FP16 or INT8 to reduce memory usage and computational cost. Knowing how the mantissa and exponent are stored can help you understand the tradeoffs involved.

Gradient scaling: During training, the gradients can sometimes become very small, leading to underflow in FP16. To combat this, techniques like gradient scaling are used, which involve multiplying the gradients by a scale factor to keep the mantissa within a representable range.

Mixed precision training: This is a technique where certain parts of the model (like the master weights) are kept in FP32, while other parts (like the activations and gradients) are computed in FP16.

Understanding how the mantissa and exponent are stored can help you decide which parts of the model can be safely computed in lower precision.

Here's a simple example in PyTorch that demonstrates the precision loss when converting from FP32 to FP16:

As you can see, the small differences between 1.0 and 1.0000001, etc., are lost when converting to FP16 because there aren't enough mantissa bits to represent these tiny differences.

This is a common issue in deep learning when using lower-precision formats.

Last updated