less_retarded_wiki/randomness.md

142 lines
12 KiB
Markdown
Raw Normal View History

2023-04-02 20:57:34 +02:00
# Randomness
*Not to be confused with [pseudorandomess](pseudorandomness.md).*
TODO
2023-09-28 17:49:46 +02:00
## Randomness Tests
TODO
One of the most basic is the **[chi-squared ncept of our measure is following: for given N, we consider a sequence s more random if it gives a lower probability of us correctly predicting the next bit from the subsequencetest](chi_squared_test.md)** whose description can be found e.g. in the *Art of Computer Programming* book. TODO
{ The following is a method I wrote about here (includes some code): https://codeberg.org/drummyfish/my_writings/src/branch/master/randomness.md, I am almost certainly not the first to invent this, but I haven't found what this is called, so for now I'm calling it "my" test, not implying any ownership of course :) If you know what this method is called, please send me a mail. ~drummyfish }
**[Drummyfish's](drummyfish.md) randomness test**: this test tries to measure the unpredictability, the inability to predict what binary digit will follow. As an input to the test we suppose a binary sequence *S* of length *N* bits that's repeating forever (for example for *N = 2* a possible sequence is 10 meaning we are really considering an infinite sequence 1010101010...). We suppose an observer knows the sequence and that it's repeating (consider he has for example been watching us broadcast it for a long time and he noticed we are just repeating the same sequence over and over), then we ask: if the observer is given a random (and randomly long) subsequence *S2* of the main sequence *S*, what's the average probability he can correctly predict the bit that will follow? This average probability is our measured randomness *r* -- the lower the *r*, the "more random" the sequence *S* is according to this test. For different *N* there are different minimum possible values of *r*, it is for example not possible to achieve *r < 0.7* for *N = 3* etc. The following table shows this test's most random sequences for given *N*, along with their count and *r*.
| seq. len. | most random looking sequences |count| min. r |
| --------- | --------------------------------------------------------------------------------------------- | --- | ------ |
| 1 | 0, 1 | 2 | 1.00 |
| 2 | 01, 10 | 2 | 0.50 |
| 3 | 001, 010, 011, 100, 101, 110 | 6 | ~0.72 |
| 4 | 0011, 0110, 1001, 1100 | 4 | ~0.78 |
| 5 | 00101, 01001, 01010, 01011, 01101, 10010, 10100, 10101, 10110, 11010 | 10 | ~0.82 |
| 6 | 000101, 001010, 010001, 010100, 010111, 011101, 100010, 101000, 101011, 101110, 110101, 111010| 12 | ~0.86 |
| 7 | 0001001, 0010001, 0010010, 0100010, 0100100, 0110111, 0111011, 1000100, 1001000, 1011011, ... | 14 | ~0.88 |
| 8 | 00100101, 00101001, 01001001, 01001010, 01010010, 01011011, 01101011, 01101101, 10010010, ... | 16 | ~0.89 |
| 9 | 000010001, 000100001, 000100010, 001000010, 001000100, 010000100, 010001000, 011101111, ... | 18 | ~0.90 |
| 10 | 0010010101, 0010101001, 0100100101, 0100101010, 0101001001, 0101010010, 0101011011, ... | 20 | ~0.91 |
| 11 | 00010001001, 00010010001, 00100010001, 00100010010, 00100100010, 01000100010, 01000100100, ...| 22 | ~0.92 |
| 12 | 001010010101, 001010100101, 010010100101, 010010101001, 010100101001, 010100101010, ... | 24 | ~0.92 |
| 13 | 0010010100101, 0010100100101, 0010100101001, 0100100101001, 0100101001001, 0100101001010, ... | 26 | ~0.93 |
| ... | ... | ... | ... |
2023-04-02 20:57:34 +02:00
## Truly Random Sequence Example
WORK IN PROGRESS { Also I'm not too good at statistics lol. ~drummyfish }
2023-05-21 15:09:23 +02:00
Here is a sequence of 1000 bits which we most definitely could consider truly random as it was generated by physical coin tosses:
2023-04-02 20:57:34 +02:00
{ The method I used to generate this: I took a plastic bowl and 10 coins, then for each round I threw the coins into the bowl, shook them (without looking, just in case), then rapidly turned it around and smashed it against the ground. I took the bowl up and wrote the ten generated bits by reading the coins kind of from "top left to bottom right" (heads being 1, tails 0). ~drummyfish }
```
00001110011101000000100001011101111101010011100011
01001101110100010011000101101001000010111111101110
10110110100010011011010001000111011010100100010011
11111000111011110111100001000000001101001101010000
11111111001000111100100011010110001011000001001000
10001010111110100111110010010101001101010000101101
10110000001101001010111100100100000110000000011000
11000001001111000011011101111110101101111011110111
11010001100100100110001111000111111001101111010010
10001001001010111000010101000100000111010110011000
00001010011100000110011010110101011100101110110010
01010010101111101000000110100011011101100100101001
00101101100100100101101100111101001101001110111100
11001001100110001110000000110000010101000101000100
00110111000100001100111000111100011010111100011011
11101111100010111000111001010110011001000011101000
01001111100101001100011100001111100011111101110101
01000101101100010000010110110000001101001100100110
11101000010101101111100111011011010100110011110000
10111100010100000101111001111011010110111000010101
```
Let's now take a look at how random the sequence looks, i.e. basically how likely it is that by generating random numbers by tossing a coin will give us a sequence with statistical properties (such as the ratio of 1s and 0s) that our obtained sequence has.
There are **494 1s and 506 0s**, i.e. the ratio is approximately 0.976, deviating from 1.0 (the value that infinitely many coin tosses should converge to) by only 0.024. We can use the [binomial distribution](binomial_distribution.md) to calculate the "rarity" of getting this deviation or higher one; here we get about 0.728, i.e. a pretty high probability, meaning that if we perform 1000 coin tosses like the one we did, we may expect to get the deviation we got or higher in more than 70% of cases (if on the other hand we only got e.g. 460 1s, this probability would be only 0.005, suggesting the coins we used weren't fair). If we take a look at how the ratio (rounded to two fractional digits) evolves after each round of performing additional 10 coin tosses, we see it gets pretty close to 1 after only about 60 tosses and stabilizes quite nicely after about 100 tosses: 0.67, 0.54, 0.67, 0.90, 0.92, 1.00, 0.94, 0.90, 0.88, 1.00, 1.04, 1.03, 0.97, 1.00, 0.97, 1.03, 1.10, 1.02, 0.98, 0.96, 1.02, 1.02, 1.02, 1.00, 0.95, 0.95, 0.99, 0.99, 0.99, 0.97, 0.95, 0.95, 0.96, 0.93, 0.90, 0.88, 0.90, 0.93, 0.95, 0.98, 0.98, 0.97, 0.97, 0.99, 1.00, 0.98, 0.98, 0.98, 0.97, 0.96, 0.95, 0.94, 0.95, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, 0.97, 0.97, 0.95, 0.94, 0.93, 0.93, 0.93, 0.94, 0.94, 0.94, 0.96, 0.95, 0.96, 0.96, 0.95, 0.96, 0.95, 0.95, 0.96, 0.97, 0.97, 0.96, 0.96, 0.95, 0.95, 0.95, 0.96, 0.97, 0.97, 0.97, 0.97, 0.96, 0.97, 0.98, 0.98.
Let's try the [chi-squared test](chi_squared_test.md) (the kind of basic "randomness" test): *D = (494 - 500)^2 / 500 + (506 - 500)^2 / 500 = 0.144*; now in the table for the chi square distribution for 1 degree of freedom (i.e. two categories, 0 and 1, minus one) we see this value of *D* falls somewhere around 30%, which is not super low but not very high either, so we can see the test doesn't invalidate the hypothesis that we got numbers from a uniform random number generator. { I did this according to Knuth's *Art of Computer Programming* where he performed a test with dice and arrived at a number between 25% and 50% which he interpreted in the same way. For a scientific paper such confidence would of course be unacceptable because there we try to "prove" the validity of our hypothesis. Here we put much lower confidence level as we're only trying not fail the test. To get a better confidence we'd probably have to perform many more than 1000 tosses. ~drummyfish }
We can try to convert this to a sequence of integers of different binary sizes and just "intuitively" see if the sequences still looks random, i.e. if there are no patterns such as e.g. the numbers only being odd or the histograms of the sequences being too unbalanced, we could also possibly repeat the chi-squared test etc.
The sequence as 100 10 bit integers (numbers from 0 to 1023) is:
```
57 832 535 501 227 311 275 90 267 1006
730 155 273 874 275 995 759 528 52 848
1020 572 565 556 72 555 935 805 309 45
704 842 969 24 24 772 963 479 695 759
838 294 241 998 978 548 696 337 29 408
41 774 429 370 946 330 1000 104 886 297
182 293 719 308 956 806 398 12 84 324
220 268 911 107 795 958 184 917 612 232
318 332 451 911 885 278 784 364 52 806
929 367 630 851 240 753 261 926 859 533
```
As 200 5 bit integers (numbers from 0 to 31):
```
1 25 26 0 16 23 15 21 7 3 9 23 8 19 2 26 8 11 31 14
22 26 4 27 8 17 27 10 8 19 31 3 23 23 16 16 1 20 26 16
31 28 17 28 17 21 17 12 2 8 17 11 29 7 25 5 9 21 1 13
22 0 26 10 30 9 0 24 0 24 24 4 30 3 14 31 21 23 23 23
26 6 9 6 7 17 31 6 30 18 17 4 21 24 10 17 0 29 12 24
1 9 24 6 13 13 11 18 29 18 10 10 31 8 3 8 27 22 9 9
5 22 9 5 22 15 9 20 29 28 25 6 12 14 0 12 2 20 10 4
6 28 8 12 28 15 3 11 24 27 29 30 5 24 28 21 19 4 7 8
9 30 10 12 14 3 28 15 27 21 8 22 24 16 11 12 1 20 25 6
29 1 11 15 19 22 26 19 7 16 23 17 8 5 28 30 26 27 16 21
```
Which has the following histogram:
```
number: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count: 6 6 3 6 5 5 7 5 11 10 7 6 7 3 4 5
number: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
count: 7 9 3 5 4 8 7 8 9 4 8 6 8 6 6 6
```
And as 250 4 bit integers (numbers from 0 to 15):
```
0 14 7 4 0 8 5 13 15 5 3 8 13 3 7 4 4 12 5 10 4 2 15 14 14
11 6 8 9 11 4 4 7 6 10 4 4 15 14 3 11 13 14 1 0 0 13 3 5 0
15 15 2 3 12 8 13 6 2 12 1 2 2 2 11 14 9 15 2 5 4 13 4 2 13
11 0 3 4 10 15 2 4 1 8 0 6 3 0 4 15 0 13 13 15 10 13 14 15 7
13 1 9 2 6 3 12 7 14 6 15 4 10 2 4 10 14 1 5 1 0 7 5 9 8
0 10 7 0 6 6 11 5 7 2 14 12 9 4 10 15 10 0 6 8 13 13 9 2 9
2 13 9 2 5 11 3 13 3 4 14 15 3 2 6 6 3 8 0 12 1 5 1 4 4
3 7 1 0 12 14 3 12 6 11 12 6 15 11 14 2 14 3 9 5 9 9 0 14 8
4 15 9 4 12 7 0 15 8 15 13 13 5 1 6 12 4 1 6 12 0 13 3 2 6
14 8 5 6 15 9 13 11 5 3 3 12 2 15 1 4 1 7 9 14 13 6 14 1 5
```
This has the following histogram:
```
number: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count: 18 14 19 18 23 15 18 11 11 14 9 10 13 20 18 19
```
2023-05-21 15:09:23 +02:00
Another way to test data randomness may be by **trying to [compress](compression.md) it**, since compression is basically based on removing regularities, redundancy, leaving only randomness. A compression algorithm exploits [correlations](correlation.md) in input data and removes that which can later be reasoned out from what's left, but with a completely random data nothing should be correlated, it shouldn't be possible to reason out parts of such data from other parts of that data, hence compression can remove nothing and it shouldn't generally be possible to compress completely random data (though of course there exists a non-zero probability that in rare cases random data will have regular structure and we will be able to compress it). Let us try to perform this test with the `lz4` compression utility -- we convert our 1000 random bits to 125 random bytes and try to compress them. Then we will try to compress another sequence of 125 bytes, this time a non-random one -- a repeated alphabet in ASCII (`abcdefghijklmnopqrstuvwxyzabcdef...`). Here are the results:
| sequence (125 bytes) | compressed size |
| -------------------- | ---------------- |
| our random bits | 144 (115.20%) |
| `abcdef...` | 56 (44.80%) |
We see that while the algorithm was able to compress the non-random sequence to less than a half of the original size, it wasn't able to compress our data, it actually made it bigger! This suggests the data is truly random. Of course it would be good to test multiple compression algorithms and see if any one of them finds some regularity in the data, but the general idea has been presented.