less_retarded_wiki/randomness.md
2024-02-17 10:47:29 +01:00

162 lines
21 KiB
Markdown

# Randomness
*Not to be confused with [pseudorandomess](pseudorandomness.md).*
Randomness means unpredictability, lack of patterns and/or behavior without cause. Random events can only be predicted imperfectly using [probability](probability.md), because there is something present that's subject to chance, something we don't know; events may be random to us either because they are inherently random (i.e. they really have no cause, pattern etc.) or because we just lack knowledge or practical ability to perfectly predict the events. Randomness is one of the most basic, yet also one of the most difficult concepts to understand about our [Universe](universe.md) -- it's a phenomenon of immense practical importance, we encounter it every second of our daily lives, but it's also of no lesser interest to science, philosophy, art and religion. Whole libraries could be filled just with books about this topic, here we will be able to only scratch the surface of it by taking a look at the very basics of randomness, mostly as related to [programming](programming.md) and [math](math.md).
As with similarly wide terms, the word *randomness* and *random* may be defined in different ways and change meaning slightly depending on context, for example sometimes we have to distinguish between "true" randomness, such as that we encounter in [quantum mechanics](quantum.md) or that present in nondeterministic mathematical models, and [pseudorandomness](pseudorandomness.md) (what as a programmer you'll be probably dealing with), i.e. imitating this true randomness with [deterministic](determinism.md) ("non-randomly behaving") systems, e.g. sequences of numbers that are difficult to [compress](compression.md). Other times we call random anything at all that just deviates from usual order, as in "someone started randomly spamming me in chat". Let's briefly review a few terms related to this topic:
- **randomness**: The wide term meaning great unpredictability, which may be inherent or just apparent. We usually divide it to:
- **true randomness**: Randomness that is caused by inherently unpredictable behavior of a system, i.e. behavior that truly has no cause and is decided purely by chance, without ever being able to be perfectly predicted, even just theoretically; this is contrasted with pseudorandomness. A typical example given is [quantum physics](quantum.md) in which true randomness seems to be present in things such as some properties of elementary particles of the Universe -- though in fact this can never be proven with certainty, there is so much evidence of us not being able to predict quantum phenomena that we just mostly take it for the closest thing to true randomness in real world. However we can also see some purely mathematical models to have true randomness, simply because they define it so, e.g. a nondeterministic [Turing machine](turing_machine.md) is simply defined to sometimes make purely random decisions.
- **[pseudorandomness](pseudorandomness.md)**: Randomness that's at its basic level generated by a completely deterministic system, i.e. something (e.g. a sequence of numbers) that practically looks like something that would be generated by truly random system, which however stems from something completely non-random (e.g. a computer program). This is contrasted with pure randomness. Chaotic systems are mostly used to implement pseudorandomness. Pseudorandomness is used to imitate true randomness e.g. in computers, because it is mostly [good enough](good_enough.md) and true randomness is difficult to achieve.
- **non[determinism](determinism.md)**: Attribute of a [system](system.md), such as mathematical model or physics theory, of involving true randomness.
- **[chaos](chaos.md)**: Behavior that is deterministic (i.e. without true randomness) which however due to its mathematical properties is practically impossible to be predicted as there is no "nice" equation for it, resulting in practically having the same implications as true randomness. Chaotic behavior is predictable in theory but not in practice as it basically just requires "brute force" simulation, and so we often treat chaotic systems the same as completely random ones, with statistics and probability.
- **[probability](probability.md)**: Mathematical theory examining randomness, it formally models systems that include randomness and reasons about them, it gives us equations, for example it says how we infer the exact probability of something happening knowing probabilities of some individual events etc. It is a theoretical area and stresses deductive reasoning, i.e. it starts by defining a system and reasons about what such system will do.
- **[statistics](statistics.md)**: Applying probability theory to examining [data](data.md) -- like probability it is a mathematical discipline, however it is applied (rather than purely theoretical) and stresses inductive reasoning, i.e. it works "in the other direction" than probability theory; statistics starts with having some data and then tries to find a probabilistic model that would likely produce such data, potentially revealing what system really lies underneath the data.
- **[stochasticity](stochastic.md)**: Basically mathematics that deals with randomness and probability in some way, the term is often used as an attribute of a mathematical model, i.e. stochastic model is that which is somehow described in terms of probabilities.
- **[entropy](entropy.md)**: A measure related to randomness, saying how much [information](information.md) (in [bits](bit.md)) we can extract from given message -- the higher the randomness (unpredictability), the higher the entropy because this randomness may be used to carry information.
Keep in mind **there are different "amounts" of randomness**, i.e. you should consider that **[probability distributions](probability_distribution.md)** exist and that some processes may be random only a little. It is not like there are only completely predictable and completely unpredictable systems, oftentimes we just have some small elements of chance or can at least estimate which outcomes are more likely. We see absolute randomness (i.e. complete unpredictability) only with uniform probability distribution, i.e. in variables in which all outcomes are equally likely -- for example rolling a dice. However in real life variables some values are usually more likely than others -- e.g. with adult human male height values such as 175 cm will be much more common than 200 cm; great many real life values actually have [normal distribution](normal_distribution.md) -- the one in which values around some center value are most common.
**What do random numbers look like?** This is a tricky question. Let's now consider uniform probability distribution, i.e. "absolute randomness". When we see sequences of numbers such as `1, 2, 3, 4, 5, 6, 7`, `0, 0, 0, 0, 0, 0, 0, 0` or `9, 1, 4, 3, 9, 1, 5`, which are "random" and which not? Intuitively we would say the first two are not random because there is a clear pattern, while the third one looks pretty random. However consider that under our assumption of uniform probability distribution all of these sequences are equally likely to occur! It is just that there are only very few sequences in which we recognize a common pattern compared to those that look to have no pattern, so we much more commonly see these sequences without a pattern coming out of random number generators and therefore we think the first two patterns are very unlikely to have come from a random source. Indeed they are, but the third, "random looking" sequence is equally unlikely (if you bet the numbers in lottery, you are still very unlikely to win), it just has great many weird looking siblings. You have to be careful, things around probability are great many times very unintuitive and tricky (see e.g. the famous [Monty Hall problem](monty_hall.md)).
Of course we cannot say just from the sequence alone if it was generated randomly or not, the sequences above may have been generated by true randomness or by pseudorandom generator -- we even see this is kind of stupid to ask. We should rather think about what we actually mean by asking whether the sequence is "random" -- to get meaningful answers we have to specify this first. If we formulate the question precisely, we may get precise answers. Sometimes we are looking for lack of patterns -- this can be tested by programs that look for patterns, e.g. [compression](compression.md) programs; number sequences that have regularities in them can be compressed well. We may examine the sequences [entropy](entropy.md) to say something about its "randomness". Mathematicians often like to ask "how likely is it that a sequence with these properties was generated by this model?", i.e. for example listening to signals from space and capturing some numeric sequence, we may compute its properties such as distribution of values in it and then we ask how likely is it that such sequence was generated by some natural source such exploding star or black hole? If we conclude this is very unlikely, we may say the signal was probably not generated randomly and may e.g. come from intelligent lifeforms.
TODO: moar
## Randomness Tests
TODO
One of the most basic is the **[chi-squared test](chi_squared_test.md)** whose description can be found e.g. in the *Art of Computer Programming* book. TODO
{ The following is a method I came up with wrote about here (includes some code): https://codeberg.org/drummyfish/my_writings/src/branch/master/randomness.md, I haven't found what this is called, it probably already exists. If you know what this method is called, please send me a mail. ~drummyfish }
**Cool randomness test**: this test tries to measure the unpredictability, the inability to predict what binary digit will follow. As an input to the test we suppose a binary sequence *S* of length *N* bits that's repeating forever (for example for *N = 2* a possible sequence is 10 meaning we are really considering an infinite sequence 1010101010...). We suppose an observer knows the sequence and that it's repeating (consider he has for example been watching us broadcast it for a long time and he noticed we are just repeating the same sequence over and over), then we ask: if the observer is given a random (and randomly long) subsequence *S2* of the main sequence *S*, what's the average probability he can correctly predict the bit that will follow? This average probability is our measured randomness *r* -- the lower the *r*, the "more random" the sequence *S* is according to this test. For different *N* there are different minimum possible values of *r*, it is for example not possible to achieve *r < 0.7* for *N = 3* etc. The following table shows this test's most random sequences for given *N*, along with their count and *r*.
| seq. len. | most random looking sequences |count| min. r |
| --------- | --------------------------------------------------------------------------------------------- | --- | ------ |
| 1 | 0, 1 | 2 | 1.00 |
| 2 | 01, 10 | 2 | 0.50 |
| 3 | 001, 010, 011, 100, 101, 110 | 6 | ~0.72 |
| 4 | 0011, 0110, 1001, 1100 | 4 | ~0.78 |
| 5 | 00101, 01001, 01010, 01011, 01101, 10010, 10100, 10101, 10110, 11010 | 10 | ~0.82 |
| 6 | 000101, 001010, 010001, 010100, 010111, 011101, 100010, 101000, 101011, 101110, 110101, 111010| 12 | ~0.86 |
| 7 | 0001001, 0010001, 0010010, 0100010, 0100100, 0110111, 0111011, 1000100, 1001000, 1011011, ... | 14 | ~0.88 |
| 8 | 00100101, 00101001, 01001001, 01001010, 01010010, 01011011, 01101011, 01101101, 10010010, ... | 16 | ~0.89 |
| 9 | 000010001, 000100001, 000100010, 001000010, 001000100, 010000100, 010001000, 011101111, ... | 18 | ~0.90 |
| 10 | 0010010101, 0010101001, 0100100101, 0100101010, 0101001001, 0101010010, 0101011011, ... | 20 | ~0.91 |
| 11 | 00010001001, 00010010001, 00100010001, 00100010010, 00100100010, 01000100010, 01000100100, ...| 22 | ~0.92 |
| 12 | 001010010101, 001010100101, 010010100101, 010010101001, 010100101001, 010100101010, ... | 24 | ~0.92 |
| 13 | 0010010100101, 0010100100101, 0010100101001, 0100100101001, 0100101001001, 0100101001010, ... | 26 | ~0.93 |
| ... | ... | ... | ... |
## Truly Random Sequence Example
WORK IN PROGRESS { Also I'm not too good at statistics lol. ~drummyfish }
Here is a sequence of 1000 bits which we most definitely could consider truly random as it was generated by physical coin tosses:
{ The method I used to generate this: I took a plastic bowl and 10 coins, then for each round I threw the coins into the bowl, shook them (without looking, just in case), then rapidly turned it around and smashed it against the ground. I took the bowl up and wrote the ten generated bits by reading the coins kind of from "top left to bottom right" (heads being 1, tails 0). ~drummyfish }
```
00001110011101000000100001011101111101010011100011
01001101110100010011000101101001000010111111101110
10110110100010011011010001000111011010100100010011
11111000111011110111100001000000001101001101010000
11111111001000111100100011010110001011000001001000
10001010111110100111110010010101001101010000101101
10110000001101001010111100100100000110000000011000
11000001001111000011011101111110101101111011110111
11010001100100100110001111000111111001101111010010
10001001001010111000010101000100000111010110011000
00001010011100000110011010110101011100101110110010
01010010101111101000000110100011011101100100101001
00101101100100100101101100111101001101001110111100
11001001100110001110000000110000010101000101000100
00110111000100001100111000111100011010111100011011
11101111100010111000111001010110011001000011101000
01001111100101001100011100001111100011111101110101
01000101101100010000010110110000001101001100100110
11101000010101101111100111011011010100110011110000
10111100010100000101111001111011010110111000010101
```
Let's now take a look at how random the sequence looks, i.e. basically how likely it is that by generating random numbers by tossing a coin will give us a sequence with statistical properties (such as the ratio of 1s and 0s) that our obtained sequence has.
There are **494 1s and 506 0s**, i.e. the ratio is approximately 0.976, deviating from 1.0 (the value that infinitely many coin tosses should converge to) by only 0.024. We can use the [binomial distribution](binomial_distribution.md) to calculate the "rarity" of getting this deviation or higher one; here we get about 0.728, i.e. a pretty high probability, meaning that if we perform 1000 coin tosses like the one we did, we may expect to get the deviation we got or higher in more than 70% of cases (if on the other hand we only got e.g. 460 1s, this probability would be only 0.005, suggesting the coins we used weren't fair). If we take a look at how the ratio (rounded to two fractional digits) evolves after each round of performing additional 10 coin tosses, we see it gets pretty close to 1 after only about 60 tosses and stabilizes quite nicely after about 100 tosses: 0.67, 0.54, 0.67, 0.90, 0.92, 1.00, 0.94, 0.90, 0.88, 1.00, 1.04, 1.03, 0.97, 1.00, 0.97, 1.03, 1.10, 1.02, 0.98, 0.96, 1.02, 1.02, 1.02, 1.00, 0.95, 0.95, 0.99, 0.99, 0.99, 0.97, 0.95, 0.95, 0.96, 0.93, 0.90, 0.88, 0.90, 0.93, 0.95, 0.98, 0.98, 0.97, 0.97, 0.99, 1.00, 0.98, 0.98, 0.98, 0.97, 0.96, 0.95, 0.94, 0.95, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, 0.97, 0.97, 0.95, 0.94, 0.93, 0.93, 0.93, 0.94, 0.94, 0.94, 0.96, 0.95, 0.96, 0.96, 0.95, 0.96, 0.95, 0.95, 0.96, 0.97, 0.97, 0.96, 0.96, 0.95, 0.95, 0.95, 0.96, 0.97, 0.97, 0.97, 0.97, 0.96, 0.97, 0.98, 0.98.
Let's try the [chi-squared test](chi_squared_test.md) (the kind of basic "randomness" test): *D = (494 - 500)^2 / 500 + (506 - 500)^2 / 500 = 0.144*; now in the table for the chi square distribution for 1 degree of freedom (i.e. two categories, 0 and 1, minus one) we see this value of *D* falls somewhere around 30%, which is not super low but not very high either, so we can see the test doesn't invalidate the hypothesis that we got numbers from a uniform random number generator. { I did this according to Knuth's *Art of Computer Programming* where he performed a test with dice and arrived at a number between 25% and 50% which he interpreted in the same way. For a scientific paper such confidence would of course be unacceptable because there we try to "prove" the validity of our hypothesis. Here we put much lower confidence level as we're only trying not fail the test. To get a better confidence we'd probably have to perform many more than 1000 tosses. ~drummyfish }
We can try to convert this to a sequence of integers of different binary sizes and just "intuitively" see if the sequences still looks random, i.e. if there are no patterns such as e.g. the numbers only being odd or the histograms of the sequences being too unbalanced, we could also possibly repeat the chi-squared test etc.
The sequence as 100 10 bit integers (numbers from 0 to 1023) is:
```
57 832 535 501 227 311 275 90 267 1006
730 155 273 874 275 995 759 528 52 848
1020 572 565 556 72 555 935 805 309 45
704 842 969 24 24 772 963 479 695 759
838 294 241 998 978 548 696 337 29 408
41 774 429 370 946 330 1000 104 886 297
182 293 719 308 956 806 398 12 84 324
220 268 911 107 795 958 184 917 612 232
318 332 451 911 885 278 784 364 52 806
929 367 630 851 240 753 261 926 859 533
```
As 200 5 bit integers (numbers from 0 to 31):
```
1 25 26 0 16 23 15 21 7 3 9 23 8 19 2 26 8 11 31 14
22 26 4 27 8 17 27 10 8 19 31 3 23 23 16 16 1 20 26 16
31 28 17 28 17 21 17 12 2 8 17 11 29 7 25 5 9 21 1 13
22 0 26 10 30 9 0 24 0 24 24 4 30 3 14 31 21 23 23 23
26 6 9 6 7 17 31 6 30 18 17 4 21 24 10 17 0 29 12 24
1 9 24 6 13 13 11 18 29 18 10 10 31 8 3 8 27 22 9 9
5 22 9 5 22 15 9 20 29 28 25 6 12 14 0 12 2 20 10 4
6 28 8 12 28 15 3 11 24 27 29 30 5 24 28 21 19 4 7 8
9 30 10 12 14 3 28 15 27 21 8 22 24 16 11 12 1 20 25 6
29 1 11 15 19 22 26 19 7 16 23 17 8 5 28 30 26 27 16 21
```
Which has the following histogram:
```
number: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count: 6 6 3 6 5 5 7 5 11 10 7 6 7 3 4 5
number: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
count: 7 9 3 5 4 8 7 8 9 4 8 6 8 6 6 6
```
And as 250 4 bit integers (numbers from 0 to 15):
```
0 14 7 4 0 8 5 13 15 5 3 8 13 3 7 4 4 12 5 10 4 2 15 14 14
11 6 8 9 11 4 4 7 6 10 4 4 15 14 3 11 13 14 1 0 0 13 3 5 0
15 15 2 3 12 8 13 6 2 12 1 2 2 2 11 14 9 15 2 5 4 13 4 2 13
11 0 3 4 10 15 2 4 1 8 0 6 3 0 4 15 0 13 13 15 10 13 14 15 7
13 1 9 2 6 3 12 7 14 6 15 4 10 2 4 10 14 1 5 1 0 7 5 9 8
0 10 7 0 6 6 11 5 7 2 14 12 9 4 10 15 10 0 6 8 13 13 9 2 9
2 13 9 2 5 11 3 13 3 4 14 15 3 2 6 6 3 8 0 12 1 5 1 4 4
3 7 1 0 12 14 3 12 6 11 12 6 15 11 14 2 14 3 9 5 9 9 0 14 8
4 15 9 4 12 7 0 15 8 15 13 13 5 1 6 12 4 1 6 12 0 13 3 2 6
14 8 5 6 15 9 13 11 5 3 3 12 2 15 1 4 1 7 9 14 13 6 14 1 5
```
This has the following histogram:
```
number: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
count: 18 14 19 18 23 15 18 11 11 14 9 10 13 20 18 19
```
Another way to test data randomness may be by **trying to [compress](compression.md) it**, since compression is basically based on removing regularities, redundancy, leaving only randomness. A compression algorithm exploits [correlations](correlation.md) in input data and removes that which can later be reasoned out from what's left, but with a completely random data nothing should be correlated, it shouldn't be possible to reason out parts of such data from other parts of that data, hence compression can remove nothing and it shouldn't generally be possible to compress completely random data (though of course there exists a non-zero probability that in rare cases random data will have regular structure and we will be able to compress it). Let us try to perform this test with the `lz4` compression utility -- we convert our 1000 random bits to 125 random bytes and try to compress them. Then we will try to compress another sequence of 125 bytes, this time a non-random one -- a repeated alphabet in ASCII (`abcdefghijklmnopqrstuvwxyzabcdef...`). Here are the results:
| sequence (125 bytes) | compressed size |
| -------------------- | ---------------- |
| our random bits | 144 (115.20%) |
| `abcdef...` | 56 (44.80%) |
We see that while the algorithm was able to compress the non-random sequence to less than a half of the original size, it wasn't able to compress our data, it actually made it bigger! This suggests the data is truly random. Of course it would be good to test multiple compression algorithms and see if any one of them finds some regularity in the data, but the general idea has been presented.