Sunday, September 22, 2019 | Toby Opferman

## IEEE Floating Point

```Toby Opferman
http://www.opferman.net
programming@opferman.net

IEEE Floating Point

In this simple tutorial we will learn IEEE floating point format for
extended, double and single precision.  Also, how to convert to and from
these formats.  Before you read this I assume you can convert whole binary
numbers to decimal.  This tutor will teach you how to convert real numbers
to floating point, but that is just beyond the decimal, the whole number
is still the same conversion so you should read the number base tutorial
if you do not know how already.

Single Precision is 32 bits (4 Bytes)
Double Precision is 64 bits (8 Bytes)
Extended Precision is 80 bits (10 Bytes)

[ 1 Sign Bit | 8 Bit Exponent  | 23 Bit Mantissa ]
[ 1 Sign Bit | 11 Bit Exponent | 53 Bit Mantissa ]
[ 1 Sign Bit | 15 Bit Exponent | 64 Bit Mantissa ]

Sign Bit is 1 = Negative, 0 = Positive

The next represent 5 different numbers in the 3 different IEEE standards:

1.0
2.0
0.0
1.08
10.333

3F 80 00 00
40 00 00 00
00 00 00 00
3F 8A 3D 71
41 25 53 F8

3F F0 00 00 00 00 00 00
40 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
3F F1 47 AE 14 7A E1 47
40 24 AA 7E F9 DB 22 D1

3F FF 80 00 00 00 00 00 00 00
40 00 80 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00
3F FF 8A 3D 70 A3 D7 0A 3D 71
40 02 A5 53 F7 CE D9 16 87 2B

Single Precision

The Exponet is stored in excess 127 and the mantissa is 1.xxxx

3F 80 00 00

Sign Bit      Exp       1.Mantissa
0          01111111  00000000000000000000000

127 - 127 = 0     1.0 bitshift 0 places

Exponent = 0, so the number is 1.0

Double Precision

The Exponet is stored in excess 127 and the mantissa is 1.xxxx

3F F0 00 00 00 00 00 00

Sign Bit    Exp                        1.Mantissa
0        01111111111  0000000000000000000000000000000000000000000000000000

Exponent stored Excess 1023

1023 - 1023 = 0    1.0 bitshift 0 places

1.0 is the answer.

Extended Precision

3F FF 80 00 00 00 00 00 00 00

Sign Bit         Exp                         Mantissa
0          011111111111111 1000000000000000000000000000000000000000000000000000000000000000

Excess 65535

16383 - 16383 = 0    1.0 bitshift 0 places

1.0 is the answer.

Single Precision:

40 00 00 00
Sign Bit     Exp        1.Mantissa
0         10000000  00000000000000000000000

128 - 127 = 1

1.0 bitshift 1 place to 10.0 the answer is 2.0

Now, you can see the others are the same and the next one is obviously 0.
But, now it's time to take the Mantissa out and find out what it is.

3F 8A 3D 71

Sign Bit    Exp               1.Mantissa
0         01111111      00010100011110101110001

Well, we know the exponent is 0 obviously since we just did the last one that way.
Now, to get the number it's almost the same as when you convert regular
binary to hex, with a small difference.

But, instead of each bit reprsenting positive powers of 2, they represent
negative powers of 2 (Starting Left to Right)
0  0  0  1  0  1  0  0  0   1   1   1   1   0   1   0   1   1   1   0   0   0   1
-1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21 -22 -23

So, you add up the powers of 2 that aren't 0.  (You multiply it with the bit,
if the bit is 0, you will get 0 so only add up the ones with a set bit)

1  1   1   1   1   1   1   1   1   1   1
-4 -6 -10 -11 -12 -13 -15 -17 -18 -19 -23

2^-4 + 2^-6 + 2^-10 + 2^-11 + 2^-12 + 2^-13 + 2^-15 + 2^-17 + 2^-18 + 2^-19 + 2^-23

.080000042915
1.080000042915 * 2^1 = 1.080000042915

You are going to have trailing numbers.

To convert TO IEEE you do the following:

You divide the number by 2^-1 and each whole number are the bits. Then you
take off the whole number and divide the decimal again.

.08/2^-1 = 0.16   1
.16/2^-1 = 0.32   2
.32/2^-1 = 0.64   3
.64/2^-1 = 1.28   4
.28/2^-1 = 0.56   5
.56/2^-1 = 1.12   6
.12/2^-1 = 0.24   7
.24/2^-1 = 0.48   8
.48/2^-1 = 0.96   9
.96/2^-1 = 1.92  10
.92/2^-1 = 1.84  11
.84/2^-1 = 1.68  12
.68/2^-1 = 1.36  13
.36/2^-1 = 0.72  14
.72/2^-1 = 1.44  15
.44/2^-1 = 0.88  16
.88/2^-1 = 1.76  17
.76/2^-1 = 1.52  18
.52/2^-1 = 1.04  19
.04/2^-1 = 0.08  20
.08/2^-1 = 0.16  21
.16/2^-1 = 0.32  22
.32/2^-1 = 0.64  23
.64/2^-1 = 1.28  24

Number Bits
0.16   1
0.32   2
0.64   3
1.28   4
0.56   5
1.12   6
0.24   7
0.48   8
0.96   9
1.92  10
1.84  11
1.68  12
1.36  13
0.72  14
1.44  15
0.88  16
1.76  17
1.52  18
1.04  19
0.08  20
0.16  21
0.32  22
0.64  23
1.28  24

Notice that the whole numbers spell out the binary for the positions. With 1 exception.
We have a 0 in the 23 bit place where in the binary above they have a 1. This is
because they took it out to 24 places like we did above, and rounded.   Since
there is a 1, we round to a 1 in the 23 bit place.  Therefore, We have
gotten the same.

Now, we do the same to the whole numbers and we have:

1.00010100011110101110001

Now, we know we need to get it into power of 2 form.  But, it looks like it's already
there.  So, we knock off the 1 and keep the 0001010001111010111000100010100011110101110001
and we just put down 127 so 127 - 127 = 0 shifts. sign bit is 0 as well.

10.333

We will decode each of these, the double precision and the extended precision.

----------------------------------------------
Double Precision

40 24 AA 7E F9 DB 22 D1
01000000 00100100 10101010 01111110 11111001 11011011 00100010 11010001

0 10000000010     0100101010100111111011111001110110110010001011010001

10000000010 = 1026

1026 - 1023 = 3

Remeber, all expoents are stored in EXCESS, so you subtract your exponent
FROM the excess to get the shit.  Remeber also, Negative shift means
shift the decimal to the left and positive shift means shift the decimal
to the right.  Only after the shift do you start counting mantissa positions.

Insert implied 1.

1.0100101010100111111011111001110110110010001011010001

Shift 3 places

1010.0101010100111111011111001110110110010001011010001

The whole number is 10.  (1010b = Ah = 10)

The mantissa.
0101010100111111011111001110110110010001011010001

Find the bit positions with 1

2, 4, 6, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 25, 26, 27, 29, 30, 32, 33, 36, 40, 42, 43, 45, 49

2^-2 + 2^-4 + 2^-6 + 2^-8 + 2^-11 + 2^-12 + 2^-13 +
2^-14 + 2^-15 + 2^-16 + 2^-18, 2^-19 + 2^-20 +
2^-21 + 2^-22 + 2^-25 + 2^-26 + 2^-27 + 2^-29 +
2^-30 + 2^-32 + 2^-33 + 2^-36 +  2^-40 + 2^-42 +
2^-43 + 2^-45 + 2^-49  =.333

-------------------------------------------------------
Extended Precision

40 02 A5 53 F7 CE D9 16 87 2B
0100 0000 0000 0010 1010 0101 0101 0011 1111 0111 1100 1110 1101 1001 0001 0110 1000 0111 0010 1011

0 100000000000010 1010010101010011111101111100111011011001000101101000011100101011

100000000000010 = 16386

16386 - 16383 = 3

So, you have 1.010010101010011111101111100111011011001000101101000011100101011

Move the decimal 3 places

1010.010101010011111101111100111011011001000101101000011100101011

Now, you will notice from this equation and the previous equation with the
extended precsion.  the first bit in the Mantissa is actually the whole number.
1.xxxxx  So, the mantissa is actually 63 bits long with 1 bit being the whole
number, so 64 bits.  Where as in the other forms, single and double, the 1
isn't written into the mantissa, it's implied to be there.

Now, if we look at the part above the decimal point, we see it's 10.

10.xxxx  Now, we need to multiply out the powers of 2^-n and add.

mbitn = mantissa bit #n from left to right.
n
You can say the mantissa is Summation(mbitn*2^-n)
i=1

Mantissa:

010101010011111101111100111011011001000101101000011100101011

The 1 is in bit positions:
2, 4, 6, 8, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 25, 26, 27, 29, 30, 32, 33, 36, 40, 42, 43, 45, 50, 51, 52, 55. 57, 59, 60

So,

2^-2 + 2^-4 + 2^-6 + 2^-8 + 2^-11 + 2^-12 + 2^-13 +
2^-14 + 2^-15 + 2^-16 + 2^-18, 2^-19 + 2^-20 +
2^-21 + 2^-22 + 2^-25 + 2^-26 + 2^-27 + 2^-29 +
2^-30 + 2^-32 + 2^-33 + 2^-36 +  2^-40 + 2^-42 +
2^-43 + 2^-45 + 2^-50 + 2^-51 + 2^-52 + 2^-55 + 2^-57 + 2^-59 + 2^-60 = .333

Now, you see how the IEEE floating point format works in Single Precision,
double precision and Extended precision.  The only difference betsize
the size of the exponent and mantissa between single/double and extended
is that single and double precisions have a bit 1.Mantissa that is
not in the format itself where in the extended format, the 1 bit is actually
IN the mantissa as the first bit and the decimal place is implied to be there.

And you notice again that the double precison rounded bit 50 to bit 49.

Single precision done on the FPU and double precision done on the FPU should
be decently accurate since the FPU of the PC is an 80 bit processor.

Extended bit math does NOT have overflow like the other two.  It goes to
bit 80 and there is no overflow math.  So, Extended floating point numbers
aren't always extremely accurate to long decimal places, they may only be as
accurate as the double precision.  Then again, you do have more places and it
may help to even have an approximation of the end.  But, just remeber,
the FPU overflows to 80 bits, so single precision and double have good rounding
approximations.

That is the end of the tutorial.  You see the format, we have decoded the format
and even went to the format on one occasion.  So, you should understand
how to convert numbers to and from IEEE to single/double/extended floating
point standards.

```

Professional software engineer with over 15 years...