Mistybeach Floating Point Encoding Visualizer
Introduction
This is intended to illustrate (as well as explain a little bit)
how floating point numbers are encoded. The primary focus is on
IEEE-754 floating point numbers. A good understanding
of these provides an excellent start to understanding other
floating point formats.
IEEE-754 floating point numbers can represent the following
values:
- Infinities
IEEE-754 floating point numbers can
represent both positive and negative infinity. Infinities
happen when results get too large in either the positive or
negative direction.
- Values that are Not A Number (NaN)
NaNs happen when
one tries to perform undefined math operations. The
canonical example of this is trying to divide by zero.
- Numeric values
These values are represented in three distinct categories:
- "Normal" values (e.g. 2½): Normal values
can be positive or negative but do not include zero (0)!
- Sub-normal values: These are values of exceptionally
small magnitude (either positive or negative).
- Zero: Zero values can be positive or negative. Which
is strange, but IEEE-754 floating point formats contain
both positive and negative zeros.
IEEE-754 floating point numbers provide these values by providing a sign
bit, an exponent and a mantissa for each floating point value.
For IEEE-754 16 bit floating point values, we get a single bit
for the sign (with the bit being set meaning the value is negative),
five bits of exponent and 10 bits of mantissa. 32-bit IEEE-754 floating
point values have one sign bit, 8 bits of exponent and 23 bits of
mantissa. A table later in this document shows some other floating point formats.
A non-obvious key insight into representing floating point values
is that all values except for zero have at least one bit set. The
implication of this is that the leading set bit can be
implied rather than explicitly specified for all values except
for zero! For all normal non-zero values the mantissa thus contains
a hidden leading one bit!
A good way to get a feel for how floating point number encodings work
is to play with the floating-point playground here. The playground
shows how 16-bit IEEE-754 floating point numbers create explicit
binary number bit patters (complete with a decimal point). These
16-bit floating point numbers are similar to 32-bit and 64-bit
floating point numbers, but smaller and thus easier to visualize.
Notice the a few key aspects to floating point numbers:
- The range of numbers that can be expressed is
large. A 16 bit floating point number can specify numbers
containing 40 explicit bits (plus the sign bit, making 41), but
- Only some of these explicit bits can be specified — 10 or 11 bits
(not including the sign bit) of the 40 for fp16 numbers — and these
bits must be a contiguous block.
- With the mantissa set to all zeros we still get non-zero
numbers because of that implicit mantissa bit. In fact,
most of the powers-of-two values (e.g. 1, 2, 4, ½) have
all the mantissa bits set to zero!
fp16 format
Play with this fp16 playground. Set the sign bit, move the slider to change
the exponent bits, set some of the mantissa bits. See what happens!
fp8 (5e2m) format
And to see the tradeoffs as one loses bits and has to decide
how to allocate the remaining bits between exponent and mantissa
we have an 8-bit floating bit format: