User:Berni44/Floatingpoint

An introduction to floating point numbers

You probably already know, that strange things can happen, when using floating point numbers.

An example:

import std.stdio;

void main()
{
    float a = 1000;
    float b = 1/a;
    float c = 1/b;
    writeln("Is ",a," == ",c,"? ",a==c?"Yes!":"No!");
}

Did you guess the answer?

Is 1000 == 1000? No!

Nano floats

To understand this strange behavior we have to look at the bit representation of the numbers involved. Unfortunately, floats have already 32 bits and with that many 0s and 1s it can easily happen that one can't see the forest for the trees.

For that reason I'll start with smaller floating point numbers, that I call nano floats. They have only 6 bits.

Floating point numbers consist of three parts: A sign bit, which is always exactly one bit, an exponent and a mantissa. Nano floats use 3 bits for the exponent and 2 bits for the mantissa. For example 1 100 01 is the bit representation of a nano float. Which number does this bit pattern represent?

You can think of floating point numbers as numbers written in scientific notation, known from physics: For example the speed of light is about +2.9979 * 10^^8 m/s. Here we've got the sign bit (+), the exponent (8) and the mantissa (2.9979). Putting this together, we could write that number as + 8 2.9979. This looks already a little bit like our number 1 100 01.

What now misses, is to decode the parts of that number. Let's start with the sign bit: That is easy. A 0 is + and a 1 is −. Next the exponent: 100 is the binary code of 4. So our exponent is 4? No, it's not that easy. Exponents can also be negative. To achieve this, we have to subtract the so called bias. The bias can be calculated from the number of bits of the exponent. If r is the number of bits of the exponent, the bias is 2^^(r-1)-1. Here, we've got r=3, and therefore the bias is 2^^2-1=3, and finally we get our exponent, it's 4-3=1.

Now the mantissa. We've seen in the speed of light example above the mantissa 2.9979. That was exactly one integral digit (2) and four fractional digits (9979). In binary system, the integral digit is (almost, see below) always 1. It would be a waste to save this 1 in our number. Therefore it's omitted. Adding it, our mantissa is 1.01 in binary code, or 1.25 in decimal code.

So putting all together we have: 1 100 01 = - 1.25 * 2 ^^ 1 = 2.5.

... to be continued

User:Berni44/Floatingpoint

An introduction to floating point numbers

Nano floats

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools