Revision as of 09:30, 15 February 2021

An introduction to floating point numbers

You probably already know, that strange things can happen, when using floating point numbers.

An example:

import std.stdio;

void main()
{
    float a = 1000;
    float b = 1/a;
    float c = 1/b;
    writeln("Is ",a," == ",c,"? ",a==c?"Yes!":"No!");
}

Did you guess the answer?

Is 1000 == 1000? No!

If we want to understand this strange behavior we have to have a deeper look into floating point numbers. Doing so involves understanding the bit representation of the numbers involved. Unfortunately, floats have already 32 bits (the code for a above is 01000100011110100000000000000000) and with that many 0s and 1s it can easily happen that one can't see the forest for the trees.

For that reason I'll start with smaller floating point numbers, that I call nano floats. Nano floats have only 6 bits.

Nano floats

Floating point numbers consist of three parts: A sign bit, which is always exactly one bit, an exponent and a mantissa. Nano floats use 3 bits for the exponent and 2 bits for the mantissa. For example 1 100 01 is the bit representation of a nano float. Which number does this bit pattern represent?

You can think of floating point numbers as numbers written in scientific notation, known from physics: For example the speed of light is about +2.9979 * 10^^8 m/s. Here we've got the sign bit (+), the exponent (8) and the mantissa (2.9979). Putting this together, we could write that number as + 8 2.9979. This looks already a little bit like our number 1 100 01.

What we still need, is to know how the parts of that number are decoded. Let's start with the sign bit, which is easy. A 0 is + and a 1 is −. We now know, that our number is negative.

Next the exponent: 100 is the binary representation of 4. So our exponent is 4? No, it's not that easy. Exponents can also be negative. To achieve this, we have to subtract the so called bias. The bias can be calculated from the number of bits of the exponent. If r is the number of bits of the exponent, the bias is 2^^(r−1)−1. Here, we've got r=3, and therefore the bias is 2^^2−1=3, and finally we get our exponent, it's 4−3=1.

Now the mantissa. We've seen in the speed of light example above, that the mantissa was 2.9979. Note, that it is usual for scientific notation, that there is always exactly one integral digit in the mantissa, in this case 2. Additionally there are four fractional digits: 9979. Now, floating point numbers use binary code instead of decimal code. This implies, that the integral digit is (almost, see below) always 1. It would be a waste to save this 1 in our number. Therefore it's omitted. Adding it to our mantissa, we've got 1.01, which is a binary number. As decimal number it is 1.25. (I assume you are familiar with the conversion from binary numbers to decimal numbers. If not, take a look at the article in wikipedia)

Putting all together we have: 1 100 01 = − 1.25 * 2 ^^ 1 = −2.5.

Exercise

I'll add exercises throughout this document. I recommend to do them — you'll acquire a much better feeling for floating point numbers, when you do this on your own, instead of peeking at the answers. But of course, it's up to you.

Exercise 1: Write down all 64 bit patterns of nano floats in a table and calculate the value, which is represented by that value:

Bit pattern	Value
0 000 00
0 000 01
0 000 10
0 000 11
0 001 00
...
1 100 01	−2.5
...
1 111 11

Zero and denormalized numbers

The table from the exercise above can be visualized on a number line — all numbers, a nano float can hold, are marked by a vertical stroke:

One clearly sees, that on the outside, the numbers are sparse. The count of numbers increases, while approaching 0 from both sides. But then, suddenly they stop and leave a gap at zero. Let's zoom in:

Now we can clearly see the gap: There is no 0. The reason for this is, that 0 is the only number, where the integral bit of the mantissa has to be 0, because there is no 1 bit available.

To get around this, numbers with exponent 000, which are called denormalized numbers or subnormal numbers, are treated special: These numbers are considered to have an integral 0 bit implied to the mantissa (instead of the usuas 1 bit) and the exponent is increased by one (to get a uniform distribution of the numbers in the vicinity of 0, see diagrams below). So 1 000 10 would be decoded as - 0.5 * 2^(-2) = -0.125.

And when the mantissa is 00 we've got our zero: 0 000 00 = +0. Unfortunately, there is a second zero: 1 000 00 = −0.

Infinity and Not a Number

There is an other exponent, which is treated special: 111. This time, if the mantissa is 00 it is considered to be infinity and if it is 11 it denotes a number, that is not a number, a so called NaN. Other values for the mantissa are also considered to be NaNs, but this time with some special meanings attached, which is beyond the scope of this article. And yes, there are also the minus versions of all of these NaNs.

With that, our number line looks like this:

And the zoomed in version looks like this:

It can now clearly be seen, that the center is equispaced and the outsides have been truncated somewhat, with the benefit of having infinity and NaN (which obviously cannot be displayed on that numberline).

Exercise

Exercise 2: Add a third column to the table from exercise 1 and write the special values in that column.

Back to our example from the beginning

Now, that we know how floating point numbers look like, we can go back to the example given at the beginning of this article. We start with a slightly changed version of the example, using nano floats this time. 1000 cannot be represented as a nano float. It would be infinity. But with 1.75 and nano floats we can have a similar effect, like the 1000 with floats:

nanofloat a = 1.75;
nanofloat b = 1/a;
nanofloat c = 1/a;

This time we can use our tables from the exercises to look up the results:

1/1.75 = 0.57142857..., which cannot exactly be coded with a nanofloat. We have to choose between 0.5 and 0.625 as an approximation. Normally, floating point units are supposed to round to the nearest possibility in such cases. (There are other rounding modes, but I do not want to go deeper into this at this place.) Here 0.625 is closer to 0.57142857... than 0.5; that is, we finally arrive at b = 0.625.

1/0.625 = 1.6. Again a value, that cannot exactly be coded as nanofloat. We've got 1.5 and 1.75 as a choice. 1.5 is closer to 1.6 than 1.75 and therefore c = 1.5. We end up with 1.75 == a != c == 1.5.

For the sake of understanding the real floating point numbers, we repeat this with 1000 and float: The exponents of floats have 8 bits (and therefore a bias of 127) and a mantissa of 23 bits (plus one implied bit).

1000 can be exactly represented with a float: a = 0 10001000 11110100000000000000000. The exponent is 136, reduced by the bias we arrive at 9. The mantissa is 1.111101, which is 1.953125 in decimal notation. And 1.953125 * 2^^9 = 1000.

The reciprocal of 1000 is 0.001. This is a number, that cannot be represented exactly with a float. The best approximation is b = 0 01110101 00000110001001001101111, which is 0.001000000047497451305389404296875 in decimal notation.

The reciprocal of that last number is 999.999952502550950618369056343..., which cannot be represented with a final number of fractional digits in decimal notation, nor can it be represented exactly with a float. This time c = 0 10001000 11110011111111111111111 is the best approximation, which is 999.99993896484375 in decimal notation.

Now it should be clear, why the program above answers the question, whether a and c are the same with "No!": They are not equal.

... to be continued

Solutions

Exercise 1 and 2:

Bit pattern	Value	Special value
0 000 00	0.125	0.0
0 000 01	0.15625	0.0625
0 000 10	0.1875	0.125
0 000 11	0.21875	0.1875
0 001 00	0.25
0 001 01	0.3125
0 001 10	0.375
0 001 11	0.4375
0 010 00	0.5
0 010 01	0.625
0 010 10	0.75
0 010 11	0.875
0 011 00	1
0 011 01	1.25
0 011 10	1.5
0 011 11	1.75
0 100 00	2
0 100 01	2.5
0 100 10	3
0 100 11	3.5
0 101 00	4
0 101 01	5
0 101 10	6
0 101 11	7
0 110 00	8
0 110 01	10
0 110 10	12
0 110 11	14
0 111 00	16	infinity
0 111 01	20	special NaN
0 111 10	24	special NaN
0 111 11	28	NaN
1 000 00	−0.125	−0,0
1 000 01	−0.15625	−0.0625
1 000 10	−0.1875	−0.125
1 000 11	−0.21875	−0.1875
1 001 00	−0.25
1 001 01	−0.3125
1 001 10	−0.375
1 001 11	−0.4375
1 010 00	−0.5
1 010 01	−0.625
1 010 10	−0.75
1 010 11	−0.875
1 011 00	−1
1 011 01	−1.25
1 011 10	−1.5
1 011 11	−1.75
1 100 00	−2
1 100 01	−2.5
1 100 10	−3
1 100 11	−3.5
1 101 00	−4
1 101 01	−5
1 101 10	−6
1 101 11	−7
1 110 00	−8
1 110 01	−10
1 110 10	−12
1 110 11	−14
1 111 00	−16	−infinity
1 111 01	−20	−special NaN
1 111 10	−24	−special NaN
1 111 11	−28	−NaN

Revision as of 09:27, 15 February 2021 (view source) Berni44 (talk \| contribs) ← Older edit		Revision as of 09:30, 15 February 2021 (view source) Berni44 (talk \| contribs) m (→‎Back to our example from the beginning) Newer edit →
Line 126:		Line 126:
	The reciprocal of that last number is <code>999.999952502550950618369056343...</code>, which cannot be represented with a final number of fractional digits in decimal notation, nor can it be represented exactly with a float. This time <code>c = 0 10001000 11110011111111111111111</code> is the best approximation, which is <code>999.99993896484375</code> in decimal notation.		The reciprocal of that last number is <code>999.999952502550950618369056343...</code>, which cannot be represented with a final number of fractional digits in decimal notation, nor can it be represented exactly with a float. This time <code>c = 0 10001000 11110011111111111111111</code> is the best approximation, which is <code>999.99993896484375</code> in decimal notation.

−	~~No,~~ it should be clear, why the program above answers the question, whether <code>a</code> and <code>c</code> are the same with "No!": They are not equal.	+	Now it should be clear, why the program above answers the question, whether <code>a</code> and <code>c</code> are the same with "No!": They are not equal.

Difference between revisions of "User:Berni44/Floatingpoint"

Revision as of 09:30, 15 February 2021

Contents

An introduction to floating point numbers

Nano floats

Exercise

Zero and denormalized numbers

Infinity and Not a Number

Exercise

Back to our example from the beginning

Solutions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools