# Difference between revisions of "User:Berni44/Floatingpoint"

(→Nano floats) |
(→Nano floats) |
||

Line 31: | Line 31: | ||

You can think of floating point numbers as numbers written in scientific notation, known from physics: For example the speed of light is about <code>+2.9979 * 10^^8 m/s</code>. Here we've got the sign bit (<code>+</code>), the exponent (<code>8</code>) and the mantissa (<code>2.9979</code>). Putting this together, we could write that number as <code>+ 8 2.9979</code>. This looks already a little bit like our number <code>1 100 01</code>. | You can think of floating point numbers as numbers written in scientific notation, known from physics: For example the speed of light is about <code>+2.9979 * 10^^8 m/s</code>. Here we've got the sign bit (<code>+</code>), the exponent (<code>8</code>) and the mantissa (<code>2.9979</code>). Putting this together, we could write that number as <code>+ 8 2.9979</code>. This looks already a little bit like our number <code>1 100 01</code>. | ||

− | What | + | What we still need, is to know how the parts of that number are decoded. Let's start with the sign bit, which is easy. A <code>0</code> is <code>+</code> and a <code>1</code> is <code>−</code>. We now know, that our number is negative. |

− | Next the exponent: 100 is the binary code of 4. So our exponent is 4? No, it's not that easy. Exponents can also be negative. To achieve this, we have to subtract the so called ''bias''. The bias can be calculated from the number of bits of the exponent. If r is the number of bits of the exponent, the bias is 2^^(r−1)−1. Here, we've got r=3, and therefore the bias is 2^^2−1=3, and finally we get our exponent, it's 4−3=1. | + | Next the exponent: <code>100</code> is the binary code of <code>4</code>. So our exponent is <code>4</code>? No, it's not that easy. Exponents can also be negative. To achieve this, we have to subtract the so called ''bias''. The bias can be calculated from the number of bits of the exponent. If ''r'' is the number of bits of the exponent, the bias is <code>2^^(r−1)−1</code>. Here, we've got ''r''=3, and therefore the bias is 2^^2−1=3, and finally we get our exponent, it's 4−3=1. |

− | Now the mantissa. We've seen in the speed of light example above the mantissa 2.9979. | + | Now the mantissa. We've seen in the speed of light example above, that the mantissa was <code>2.9979</code>. Note, that it is usual for scientific notation, that there is always exactly one integral digit in the mantissa, in this case <code>2</code>. Additionally there are four fractional digits: <code>9979</code>. Now, floating point numbers use binary code instead of decimal code. This implies, that the integral digit is (almost, see below) always <code>1</code>. It would be a waste to save this <code>1</code> in our number. Therefore it's omitted. Adding it to our mantissa, we've got <code>1.01</code> in binary code, which is <code>1.25</code> in decimal code. |

− | + | Putting all together we have: <code>1 100 01 = − 1.25 * 2 ^^ 1 = −2.5</code>. | |

− | + | === Exercise === | |

+ | |||

+ | I'll add exercises throughout this document. I recommend to do them — you'll acquire a much better feeling for floating point numbers, when you do this on your own, instead of peeking at the answers. But of course, it's up to you. | ||

+ | |||

+ | ''Exercise 1: Write down all 64 bit patterns of nano floats in a table and calculate the value, which is represented by that value:'' | ||

{|class="wikitable" | {|class="wikitable" | ||

Line 63: | Line 67: | ||

|1 111 11|| | |1 111 11|| | ||

|} | |} | ||

+ | |||

... to be continued | ... to be continued | ||

+ | |||

+ | == Solutions == | ||

+ | |||

+ | ``Exercise 1:`` | ||

+ | |||

+ | {|class="wikitable" | ||

+ | !Bit pattern | ||

+ | !Value | ||

+ | |- | ||

+ | |0 000 00||0,125 | ||

+ | |- | ||

+ | |0 000 01||0,15625 | ||

+ | |- | ||

+ | |0 000 10||0,1875 | ||

+ | |- | ||

+ | |0 000 11||0,21875 | ||

+ | |- | ||

+ | |0 001 00||0,25 | ||

+ | |- | ||

+ | |0 001 01||0,3125 | ||

+ | |- | ||

+ | |0 001 10||0,375 | ||

+ | |- | ||

+ | |0 001 11||0,4375 | ||

+ | |- | ||

+ | |0 010 00||0,5 | ||

+ | |- | ||

+ | |0 010 01||0,625 | ||

+ | |- | ||

+ | |0 010 10||0,75 | ||

+ | |- | ||

+ | |0 010 11||0,875 | ||

+ | |- | ||

+ | |0 011 00||1 | ||

+ | |- | ||

+ | |0 011 01||1,25 | ||

+ | |- | ||

+ | |0 011 10||1,5 | ||

+ | |- | ||

+ | |0 011 11||1,75 | ||

+ | |- | ||

+ | |0 100 00||2 | ||

+ | |- | ||

+ | |0 100 01||2,5 | ||

+ | |- | ||

+ | |0 100 10||3 | ||

+ | |- | ||

+ | |0 100 11||3,5 | ||

+ | |- | ||

+ | |0 101 00||4 | ||

+ | |- | ||

+ | |0 101 01||5 | ||

+ | |- | ||

+ | |0 101 10||6 | ||

+ | |- | ||

+ | |0 101 11||7 | ||

+ | |- | ||

+ | |0 110 00||8 | ||

+ | |- | ||

+ | |0 110 01||10 | ||

+ | |- | ||

+ | |0 110 10||12 | ||

+ | |- | ||

+ | |0 110 11||14 | ||

+ | |- | ||

+ | |0 111 00||16 | ||

+ | |- | ||

+ | |0 111 01||20 | ||

+ | |- | ||

+ | |0 111 10||24 | ||

+ | |- | ||

+ | |0 111 11||28 | ||

+ | |- | ||

+ | |1 000 00||−0,125 | ||

+ | |- | ||

+ | |1 000 01||−0,15625 | ||

+ | |- | ||

+ | |1 000 10||−0,1875 | ||

+ | |- | ||

+ | |1 000 11||−0,21875 | ||

+ | |- | ||

+ | |1 001 00||−0,25 | ||

+ | |- | ||

+ | |1 001 01||−0,3125 | ||

+ | |- | ||

+ | |1 001 10||−0,375 | ||

+ | |- | ||

+ | |1 001 11||−0,4375 | ||

+ | |- | ||

+ | |1 010 00||−0,5 | ||

+ | |- | ||

+ | |1 010 01||−0,625 | ||

+ | |- | ||

+ | |1 010 10||−0,75 | ||

+ | |- | ||

+ | |1 010 11||−0,875 | ||

+ | |- | ||

+ | |1 011 00||−1 | ||

+ | |- | ||

+ | |1 011 01||−1,25 | ||

+ | |- | ||

+ | |1 011 10||−1,5 | ||

+ | |- | ||

+ | |1 011 11||−1,75 | ||

+ | |- | ||

+ | |1 100 00||−2 | ||

+ | |- | ||

+ | |1 100 01||−2,5 | ||

+ | |- | ||

+ | |1 100 10||−3 | ||

+ | |- | ||

+ | |1 100 11||−3,5 | ||

+ | |- | ||

+ | |1 101 00||−4 | ||

+ | |- | ||

+ | |1 101 01||−5 | ||

+ | |- | ||

+ | |1 101 10||−6 | ||

+ | |- | ||

+ | |1 101 11||−7 | ||

+ | |- | ||

+ | |1 110 00||−8 | ||

+ | |- | ||

+ | |1 110 01||−10 | ||

+ | |- | ||

+ | |1 110 10||−12 | ||

+ | |- | ||

+ | |1 110 11||−14 | ||

+ | |- | ||

+ | |1 111 00||−16 | ||

+ | |- | ||

+ | |1 111 01||−20 | ||

+ | |- | ||

+ | |1 111 10||−24 | ||

+ | |- | ||

+ | |1 111 11||−28 | ||

+ | |} |

## Revision as of 08:15, 14 February 2021

## An introduction to floating point numbers

You probably already know, that strange things can happen, when using floating point numbers.

An example:

```
import std.stdio;
void main()
{
float a = 1000;
float b = 1/a;
float c = 1/b;
writeln("Is ",a," == ",c,"? ",a==c?"Yes!":"No!");
}
```

Did you guess the answer?

Is 1000 == 1000? No!

To understand this strange behavior we have to look at the bit representation of the numbers involved. Unfortunately, floats have already 32 bits and with that many 0s and 1s it can easily happen that one can't see the forest for the trees.

For that reason I'll start with smaller floating point numbers, that I call *nano floats*. Nano floats have only 6 bits.

## Nano floats

Floating point numbers consist of three parts: A sign bit, which is always exactly one bit, an exponent and a mantissa. Nano floats use 3 bits for the exponent and 2 bits for the mantissa. For example `1 100 01`

is the bit representation of a nano float. Which number does this bit pattern represent?

You can think of floating point numbers as numbers written in scientific notation, known from physics: For example the speed of light is about `+2.9979 * 10^^8 m/s`

. Here we've got the sign bit (`+`

), the exponent (`8`

) and the mantissa (`2.9979`

). Putting this together, we could write that number as `+ 8 2.9979`

. This looks already a little bit like our number `1 100 01`

.

What we still need, is to know how the parts of that number are decoded. Let's start with the sign bit, which is easy. A `0`

is `+`

and a `1`

is `−`

. We now know, that our number is negative.

Next the exponent: `100`

is the binary code of `4`

. So our exponent is `4`

? No, it's not that easy. Exponents can also be negative. To achieve this, we have to subtract the so called *bias*. The bias can be calculated from the number of bits of the exponent. If *r* is the number of bits of the exponent, the bias is `2^^(r−1)−1`

. Here, we've got *r*=3, and therefore the bias is 2^^2−1=3, and finally we get our exponent, it's 4−3=1.

Now the mantissa. We've seen in the speed of light example above, that the mantissa was `2.9979`

. Note, that it is usual for scientific notation, that there is always exactly one integral digit in the mantissa, in this case `2`

. Additionally there are four fractional digits: `9979`

. Now, floating point numbers use binary code instead of decimal code. This implies, that the integral digit is (almost, see below) always `1`

. It would be a waste to save this `1`

in our number. Therefore it's omitted. Adding it to our mantissa, we've got `1.01`

in binary code, which is `1.25`

in decimal code.

Putting all together we have: `1 100 01 = − 1.25 * 2 ^^ 1 = −2.5`

.

### Exercise

I'll add exercises throughout this document. I recommend to do them — you'll acquire a much better feeling for floating point numbers, when you do this on your own, instead of peeking at the answers. But of course, it's up to you.

*Exercise 1: Write down all 64 bit patterns of nano floats in a table and calculate the value, which is represented by that value:*

Bit pattern | Value |
---|---|

0 000 00 | |

0 000 01 | |

0 000 10 | |

0 000 11 | |

0 001 00 | |

... | |

1 100 01 | −2.5 |

... | |

1 111 11 |

... to be continued

## Solutions

``Exercise 1:``

Bit pattern | Value |
---|---|

0 000 00 | 0,125 |

0 000 01 | 0,15625 |

0 000 10 | 0,1875 |

0 000 11 | 0,21875 |

0 001 00 | 0,25 |

0 001 01 | 0,3125 |

0 001 10 | 0,375 |

0 001 11 | 0,4375 |

0 010 00 | 0,5 |

0 010 01 | 0,625 |

0 010 10 | 0,75 |

0 010 11 | 0,875 |

0 011 00 | 1 |

0 011 01 | 1,25 |

0 011 10 | 1,5 |

0 011 11 | 1,75 |

0 100 00 | 2 |

0 100 01 | 2,5 |

0 100 10 | 3 |

0 100 11 | 3,5 |

0 101 00 | 4 |

0 101 01 | 5 |

0 101 10 | 6 |

0 101 11 | 7 |

0 110 00 | 8 |

0 110 01 | 10 |

0 110 10 | 12 |

0 110 11 | 14 |

0 111 00 | 16 |

0 111 01 | 20 |

0 111 10 | 24 |

0 111 11 | 28 |

1 000 00 | −0,125 |

1 000 01 | −0,15625 |

1 000 10 | −0,1875 |

1 000 11 | −0,21875 |

1 001 00 | −0,25 |

1 001 01 | −0,3125 |

1 001 10 | −0,375 |

1 001 11 | −0,4375 |

1 010 00 | −0,5 |

1 010 01 | −0,625 |

1 010 10 | −0,75 |

1 010 11 | −0,875 |

1 011 00 | −1 |

1 011 01 | −1,25 |

1 011 10 | −1,5 |

1 011 11 | −1,75 |

1 100 00 | −2 |

1 100 01 | −2,5 |

1 100 10 | −3 |

1 100 11 | −3,5 |

1 101 00 | −4 |

1 101 01 | −5 |

1 101 10 | −6 |

1 101 11 | −7 |

1 110 00 | −8 |

1 110 01 | −10 |

1 110 10 | −12 |

1 110 11 | −14 |

1 111 00 | −16 |

1 111 01 | −20 |

1 111 10 | −24 |

1 111 11 | −28 |