Mistybeach Floating Point Encoding Visualizer

Introduction

This is intended to illustrate (as well as explain a little bit) how floating point numbers are encoded. The primary focus is on IEEE-754 floating point numbers. A good understanding of these provides an excellent start to understanding other floating point formats.

IEEE-754 floating point numbers can represent the following values:

Infinities
IEEE-754 floating point numbers can represent both positive and negative infinity. Infinities happen when results get too large in either the positive or negative direction.
Values that are Not A Number (NaN)
NaNs happen when one tries to perform undefined math operations. The canonical example of this is trying to divide by zero.
Numeric values
These values are represented in three distinct categories:
- "Normal" values (e.g. 2½): Normal values can be positive or negative but do not include zero (0)!
- Sub-normal values: These are values of exceptionally small magnitude (either positive or negative).
- Zero: Zero values can be positive or negative. Which is strange, but IEEE-754 floating point formats contain both positive and negative zeros.

IEEE-754 floating point numbers provide these values by providing a sign bit, an exponent and a mantissa for each floating point value.

For IEEE-754 16 bit floating point values, we get a single bit for the sign (with the bit being set meaning the value is negative), five bits of exponent and 10 bits of mantissa. 32-bit IEEE-754 floating point values have one sign bit, 8 bits of exponent and 23 bits of mantissa. A table later in this document shows some other floating point formats.

A non-obvious key insight into representing floating point values is that all values except for zero have at least one bit set. The implication of this is that the leading set bit can be implied rather than explicitly specified for all values except for zero! For all normal non-zero values the mantissa thus contains a hidden leading one bit!

A good way to get a feel for how floating point number encodings work is to play with the floating-point playground here. The playground shows how 16-bit IEEE-754 floating point numbers create explicit binary number bit patters (complete with a decimal point). These 16-bit floating point numbers are similar to 32-bit and 64-bit floating point numbers, but smaller and thus easier to visualize.

Notice the a few key aspects to floating point numbers:

The range of numbers that can be expressed is large. A 16 bit floating point number can specify numbers containing 40 explicit bits (plus the sign bit, making 41), but
Only some of these explicit bits can be specified — 10 or 11 bits (not including the sign bit) of the 40 for fp16 numbers — and these bits must be a contiguous block.
With the mantissa set to all zeros we still get non-zero numbers because of that implicit mantissa bit. In fact, most of the powers-of-two values (e.g. 1, 2, 4, ½) have all the mantissa bits set to zero!

fp16 format

Play with this fp16 playground. Set the sign bit, move the slider to change the exponent bits, set some of the mantissa bits. See what happens!

fp8 (5e2m) format

And to see the tradeoffs as one loses bits and has to decide how to allocate the remaining bits between exponent and mantissa we have an 8-bit floating bit format: