This commit is contained in:
Miloslav Ciz 2024-12-30 20:45:53 +01:00
parent 9f0e34a0dd
commit 94fd1c5b4a
23 changed files with 1955 additions and 1939 deletions

View file

@ -14,11 +14,11 @@ Is floating point literal evil? Well, of course not, but it is extremely overuse
## How It Works
The very basic idea is following: we have digits in memory and in addition we have a position of the radix point among these digits, i.e. both digits and position of the radix point can change. The fact that the radix point can move is reflected in the name *floating point*. In the end any number stored in float can be written with a finite number of digits with a radix point, e.g. 12.34. Notice that any such number can also always be written as a simple fraction of two integers (e.g. 12.34 = 1 * 10 + 2 * 1 + 3 * 1/10 + 4 * 1/100 = 617/50), i.e. any such number is always a rational number. This is why we say that floats represent fractional numbers and not true real numbers (real numbers such as [pi](pi.md), [e](e.md) or square root of 2 can only be approximated).
The gist of the basic idea is this: we have digits in memory and in addition we have a position of the radix point among these digits, i.e. both digits and position of the radix point can change. The fact that the radix point can move is reflected in the name *floating point*. In the end any number stored in float can be written with a finite number of digits with a radix point, e.g. 12.34. Notice that any such number can also always be written as a simple fraction of two integers (e.g. 12.34 = 1 * 10 + 2 * 1 + 3 * 1/10 + 4 * 1/100 = 617/50), i.e. any such number is always a rational number. This is why we say that floats represent fractional numbers and not true real numbers (real numbers such as [pi](pi.md), [e](e.md) or square root of 2 can only be approximated).
More precisely floats represent numbers by representing two main parts: the *base* -- actual encoded digits, called **mantissa** (or significand etc.) -- and the position of the radix point. The position of radix point is called the **exponent** because mathematically the floating point works similarly to the scientific notation of extreme numbers that use exponentiation. For example instead of writing 0.0000123 scientists write 123 * 10^-7 -- here 123 would be the mantissa and -7 the exponent.
More precisely floats represent numbers by storing two main parts: the *base* -- actual encoded digits, called **mantissa** (or significand etc.) -- and the position of the radix point. The position of radix point is called the **exponent** because mathematically the floating point works similarly to the scientific notation of extreme numbers that use exponentiation. For example instead of writing 0.0000123 scientists write 123 * 10^-7 -- here 123 would be the mantissa and -7 the exponent.
Though various numeric bases can be used, in [computers](computer.md) we normally use [base 2](binary.md), so let's consider it from now on. So our numbers will be of format:
Though various numeric bases come to consideration, in [computers](computer.md) we almost exclusively use [base 2](binary.md), so we are about to stick with base 2 from now on. Moving on, our numbers will be of format:
*mantissa * 2^exponent*
@ -33,7 +33,7 @@ So for example the binary representation `110011` stores mantissa `110` (6) and
Note a few things: firstly our format is [shit](shit.md) because some numbers have multiple representations, e.g. 0 can be represented as `000000`, `000001`, `000010`, `000011` etc., in fact we have 8 zeros! That's unforgivable and formats used in practice address this (usually by prepending an implicit 1 to mantissa).
Secondly notice the non-uniform distribution of our numbers: while we have a nice resolution close to 0 (we can represent 1/16, 2/16, 3/16, ...), our resolution in high numbers is low (the highest number we can represent is 56 but the second highest is 48, we can NOT represent e.g. 50 exactly). Realize that obviously with 6 bits we can still represent only 64 numbers at most! So float is NOT a magical way to get more numbers, with integers on 6 bits we can represent numbers from 0 to 63 spaced exactly by 1 and with our floating point we can represent numbers spaced as close as 1/16th but only in the region near 0, we pay the price of having big gaps in higher numbers.
Secondly observe the non-uniform distribution of our numbers: whilst we have good resolution close to 0 (we can represent 1/16, 2/16, 3/16, ...), the resolution in high numbers falls (the highest number we can represent is 56 but the second highest is 48, we can NOT represent e.g. 50 exactly). Realize that obviously with 6 bits we can still represent only 64 numbers at most! So float is NOT a magical way to get more numbers, with integers on 6 bits we can represent numbers from 0 to 63 spaced exactly by 1 and with our floating point we can represent numbers spaced as close as 1/16th but only in the region near 0, we pay the price of having big gaps in higher numbers.
Also notice that things like simple addition of numbers become more difficult and time consuming, you have to include conversions and [rounding](rounding.md) -- while with fixed point addition is a single machine instruction, same as integer addition, here with software implementation we might end up with dozens of instructions (specialized hardware can perform addition fast but still, not all computer have that hardware).