142 lines
12 KiB
Markdown
142 lines
12 KiB
Markdown
# Randomness
|
|
|
|
*Not to be confused with [pseudorandomess](pseudorandomness.md).*
|
|
|
|
TODO
|
|
|
|
## Randomness Tests
|
|
|
|
TODO
|
|
|
|
One of the most basic is the **[chi-squared test](chi_squared_test.md)** whose description can be found e.g. in the *Art of Computer Programming* book. TODO
|
|
|
|
{ The following is a method I wrote about here (includes some code): https://codeberg.org/drummyfish/my_writings/src/branch/master/randomness.md, I am almost certainly not the first to invent this, but I haven't found what this is called, so for now I'm calling it "my" test, not implying any ownership of course :) If you know what this method is called, please send me a mail. ~drummyfish }
|
|
|
|
**[Drummyfish's](drummyfish.md) randomness test**: this test tries to measure the unpredictability, the inability to predict what binary digit will follow. As an input to the test we suppose a binary sequence *S* of length *N* bits that's repeating forever (for example for *N = 2* a possible sequence is 10 meaning we are really considering an infinite sequence 1010101010...). We suppose an observer knows the sequence and that it's repeating (consider he has for example been watching us broadcast it for a long time and he noticed we are just repeating the same sequence over and over), then we ask: if the observer is given a random (and randomly long) subsequence *S2* of the main sequence *S*, what's the average probability he can correctly predict the bit that will follow? This average probability is our measured randomness *r* -- the lower the *r*, the "more random" the sequence *S* is according to this test. For different *N* there are different minimum possible values of *r*, it is for example not possible to achieve *r < 0.7* for *N = 3* etc. The following table shows this test's most random sequences for given *N*, along with their count and *r*.
|
|
|
|
| seq. len. | most random looking sequences |count| min. r |
|
|
| --------- | --------------------------------------------------------------------------------------------- | --- | ------ |
|
|
| 1 | 0, 1 | 2 | 1.00 |
|
|
| 2 | 01, 10 | 2 | 0.50 |
|
|
| 3 | 001, 010, 011, 100, 101, 110 | 6 | ~0.72 |
|
|
| 4 | 0011, 0110, 1001, 1100 | 4 | ~0.78 |
|
|
| 5 | 00101, 01001, 01010, 01011, 01101, 10010, 10100, 10101, 10110, 11010 | 10 | ~0.82 |
|
|
| 6 | 000101, 001010, 010001, 010100, 010111, 011101, 100010, 101000, 101011, 101110, 110101, 111010| 12 | ~0.86 |
|
|
| 7 | 0001001, 0010001, 0010010, 0100010, 0100100, 0110111, 0111011, 1000100, 1001000, 1011011, ... | 14 | ~0.88 |
|
|
| 8 | 00100101, 00101001, 01001001, 01001010, 01010010, 01011011, 01101011, 01101101, 10010010, ... | 16 | ~0.89 |
|
|
| 9 | 000010001, 000100001, 000100010, 001000010, 001000100, 010000100, 010001000, 011101111, ... | 18 | ~0.90 |
|
|
| 10 | 0010010101, 0010101001, 0100100101, 0100101010, 0101001001, 0101010010, 0101011011, ... | 20 | ~0.91 |
|
|
| 11 | 00010001001, 00010010001, 00100010001, 00100010010, 00100100010, 01000100010, 01000100100, ...| 22 | ~0.92 |
|
|
| 12 | 001010010101, 001010100101, 010010100101, 010010101001, 010100101001, 010100101010, ... | 24 | ~0.92 |
|
|
| 13 | 0010010100101, 0010100100101, 0010100101001, 0100100101001, 0100101001001, 0100101001010, ... | 26 | ~0.93 |
|
|
| ... | ... | ... | ... |
|
|
|
|
## Truly Random Sequence Example
|
|
|
|
WORK IN PROGRESS { Also I'm not too good at statistics lol. ~drummyfish }
|
|
|
|
Here is a sequence of 1000 bits which we most definitely could consider truly random as it was generated by physical coin tosses:
|
|
|
|
{ The method I used to generate this: I took a plastic bowl and 10 coins, then for each round I threw the coins into the bowl, shook them (without looking, just in case), then rapidly turned it around and smashed it against the ground. I took the bowl up and wrote the ten generated bits by reading the coins kind of from "top left to bottom right" (heads being 1, tails 0). ~drummyfish }
|
|
|
|
```
|
|
00001110011101000000100001011101111101010011100011
|
|
01001101110100010011000101101001000010111111101110
|
|
10110110100010011011010001000111011010100100010011
|
|
11111000111011110111100001000000001101001101010000
|
|
11111111001000111100100011010110001011000001001000
|
|
10001010111110100111110010010101001101010000101101
|
|
10110000001101001010111100100100000110000000011000
|
|
11000001001111000011011101111110101101111011110111
|
|
11010001100100100110001111000111111001101111010010
|
|
10001001001010111000010101000100000111010110011000
|
|
00001010011100000110011010110101011100101110110010
|
|
01010010101111101000000110100011011101100100101001
|
|
00101101100100100101101100111101001101001110111100
|
|
11001001100110001110000000110000010101000101000100
|
|
00110111000100001100111000111100011010111100011011
|
|
11101111100010111000111001010110011001000011101000
|
|
01001111100101001100011100001111100011111101110101
|
|
01000101101100010000010110110000001101001100100110
|
|
11101000010101101111100111011011010100110011110000
|
|
10111100010100000101111001111011010110111000010101
|
|
```
|
|
|
|
Let's now take a look at how random the sequence looks, i.e. basically how likely it is that by generating random numbers by tossing a coin will give us a sequence with statistical properties (such as the ratio of 1s and 0s) that our obtained sequence has.
|
|
|
|
There are **494 1s and 506 0s**, i.e. the ratio is approximately 0.976, deviating from 1.0 (the value that infinitely many coin tosses should converge to) by only 0.024. We can use the [binomial distribution](binomial_distribution.md) to calculate the "rarity" of getting this deviation or higher one; here we get about 0.728, i.e. a pretty high probability, meaning that if we perform 1000 coin tosses like the one we did, we may expect to get the deviation we got or higher in more than 70% of cases (if on the other hand we only got e.g. 460 1s, this probability would be only 0.005, suggesting the coins we used weren't fair). If we take a look at how the ratio (rounded to two fractional digits) evolves after each round of performing additional 10 coin tosses, we see it gets pretty close to 1 after only about 60 tosses and stabilizes quite nicely after about 100 tosses: 0.67, 0.54, 0.67, 0.90, 0.92, 1.00, 0.94, 0.90, 0.88, 1.00, 1.04, 1.03, 0.97, 1.00, 0.97, 1.03, 1.10, 1.02, 0.98, 0.96, 1.02, 1.02, 1.02, 1.00, 0.95, 0.95, 0.99, 0.99, 0.99, 0.97, 0.95, 0.95, 0.96, 0.93, 0.90, 0.88, 0.90, 0.93, 0.95, 0.98, 0.98, 0.97, 0.97, 0.99, 1.00, 0.98, 0.98, 0.98, 0.97, 0.96, 0.95, 0.94, 0.95, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, 0.97, 0.97, 0.95, 0.94, 0.93, 0.93, 0.93, 0.94, 0.94, 0.94, 0.96, 0.95, 0.96, 0.96, 0.95, 0.96, 0.95, 0.95, 0.96, 0.97, 0.97, 0.96, 0.96, 0.95, 0.95, 0.95, 0.96, 0.97, 0.97, 0.97, 0.97, 0.96, 0.97, 0.98, 0.98.
|
|
|
|
Let's try the [chi-squared test](chi_squared_test.md) (the kind of basic "randomness" test): *D = (494 - 500)^2 / 500 + (506 - 500)^2 / 500 = 0.144*; now in the table for the chi square distribution for 1 degree of freedom (i.e. two categories, 0 and 1, minus one) we see this value of *D* falls somewhere around 30%, which is not super low but not very high either, so we can see the test doesn't invalidate the hypothesis that we got numbers from a uniform random number generator. { I did this according to Knuth's *Art of Computer Programming* where he performed a test with dice and arrived at a number between 25% and 50% which he interpreted in the same way. For a scientific paper such confidence would of course be unacceptable because there we try to "prove" the validity of our hypothesis. Here we put much lower confidence level as we're only trying not fail the test. To get a better confidence we'd probably have to perform many more than 1000 tosses. ~drummyfish }
|
|
|
|
We can try to convert this to a sequence of integers of different binary sizes and just "intuitively" see if the sequences still looks random, i.e. if there are no patterns such as e.g. the numbers only being odd or the histograms of the sequences being too unbalanced, we could also possibly repeat the chi-squared test etc.
|
|
|
|
The sequence as 100 10 bit integers (numbers from 0 to 1023) is:
|
|
|
|
```
|
|
57 832 535 501 227 311 275 90 267 1006
|
|
730 155 273 874 275 995 759 528 52 848
|
|
1020 572 565 556 72 555 935 805 309 45
|
|
704 842 969 24 24 772 963 479 695 759
|
|
838 294 241 998 978 548 696 337 29 408
|
|
41 774 429 370 946 330 1000 104 886 297
|
|
182 293 719 308 956 806 398 12 84 324
|
|
220 268 911 107 795 958 184 917 612 232
|
|
318 332 451 911 885 278 784 364 52 806
|
|
929 367 630 851 240 753 261 926 859 533
|
|
```
|
|
|
|
As 200 5 bit integers (numbers from 0 to 31):
|
|
|
|
```
|
|
1 25 26 0 16 23 15 21 7 3 9 23 8 19 2 26 8 11 31 14
|
|
22 26 4 27 8 17 27 10 8 19 31 3 23 23 16 16 1 20 26 16
|
|
31 28 17 28 17 21 17 12 2 8 17 11 29 7 25 5 9 21 1 13
|
|
22 0 26 10 30 9 0 24 0 24 24 4 30 3 14 31 21 23 23 23
|
|
26 6 9 6 7 17 31 6 30 18 17 4 21 24 10 17 0 29 12 24
|
|
1 9 24 6 13 13 11 18 29 18 10 10 31 8 3 8 27 22 9 9
|
|
5 22 9 5 22 15 9 20 29 28 25 6 12 14 0 12 2 20 10 4
|
|
6 28 8 12 28 15 3 11 24 27 29 30 5 24 28 21 19 4 7 8
|
|
9 30 10 12 14 3 28 15 27 21 8 22 24 16 11 12 1 20 25 6
|
|
29 1 11 15 19 22 26 19 7 16 23 17 8 5 28 30 26 27 16 21
|
|
```
|
|
|
|
Which has the following histogram:
|
|
|
|
```
|
|
number: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
|
count: 6 6 3 6 5 5 7 5 11 10 7 6 7 3 4 5
|
|
|
|
number: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
|
count: 7 9 3 5 4 8 7 8 9 4 8 6 8 6 6 6
|
|
```
|
|
|
|
And as 250 4 bit integers (numbers from 0 to 15):
|
|
|
|
```
|
|
0 14 7 4 0 8 5 13 15 5 3 8 13 3 7 4 4 12 5 10 4 2 15 14 14
|
|
11 6 8 9 11 4 4 7 6 10 4 4 15 14 3 11 13 14 1 0 0 13 3 5 0
|
|
15 15 2 3 12 8 13 6 2 12 1 2 2 2 11 14 9 15 2 5 4 13 4 2 13
|
|
11 0 3 4 10 15 2 4 1 8 0 6 3 0 4 15 0 13 13 15 10 13 14 15 7
|
|
13 1 9 2 6 3 12 7 14 6 15 4 10 2 4 10 14 1 5 1 0 7 5 9 8
|
|
0 10 7 0 6 6 11 5 7 2 14 12 9 4 10 15 10 0 6 8 13 13 9 2 9
|
|
2 13 9 2 5 11 3 13 3 4 14 15 3 2 6 6 3 8 0 12 1 5 1 4 4
|
|
3 7 1 0 12 14 3 12 6 11 12 6 15 11 14 2 14 3 9 5 9 9 0 14 8
|
|
4 15 9 4 12 7 0 15 8 15 13 13 5 1 6 12 4 1 6 12 0 13 3 2 6
|
|
14 8 5 6 15 9 13 11 5 3 3 12 2 15 1 4 1 7 9 14 13 6 14 1 5
|
|
```
|
|
|
|
This has the following histogram:
|
|
|
|
```
|
|
number: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
|
count: 18 14 19 18 23 15 18 11 11 14 9 10 13 20 18 19
|
|
```
|
|
|
|
Another way to test data randomness may be by **trying to [compress](compression.md) it**, since compression is basically based on removing regularities, redundancy, leaving only randomness. A compression algorithm exploits [correlations](correlation.md) in input data and removes that which can later be reasoned out from what's left, but with a completely random data nothing should be correlated, it shouldn't be possible to reason out parts of such data from other parts of that data, hence compression can remove nothing and it shouldn't generally be possible to compress completely random data (though of course there exists a non-zero probability that in rare cases random data will have regular structure and we will be able to compress it). Let us try to perform this test with the `lz4` compression utility -- we convert our 1000 random bits to 125 random bytes and try to compress them. Then we will try to compress another sequence of 125 bytes, this time a non-random one -- a repeated alphabet in ASCII (`abcdefghijklmnopqrstuvwxyzabcdef...`). Here are the results:
|
|
|
|
| sequence (125 bytes) | compressed size |
|
|
| -------------------- | ---------------- |
|
|
| our random bits | 144 (115.20%) |
|
|
| `abcdef...` | 56 (44.80%) |
|
|
|
|
We see that while the algorithm was able to compress the non-random sequence to less than a half of the original size, it wasn't able to compress our data, it actually made it bigger! This suggests the data is truly random. Of course it would be good to test multiple compression algorithms and see if any one of them finds some regularity in the data, but the general idea has been presented. |