Update

2025-05-09 22:58:17 +02:00 · 2025-05-09 22:58:17 +02:00 · 1343c90ee8
commit 1343c90ee8
parent 64fd120266
19 changed files with 2055 additions and 1950 deletions
--- a/randomness.md
+++ b/randomness.md
@ -18,7 +18,7 @@ As with similarly wide spanning terms, the word *randomness* and *random* may be

 Keep in mind **there are different "amounts" of randomness** -- that is to say you should consider that **[probability distributions](probability_distribution.md)** exist and that some processes may be random only a little. It is not like there are only completely predictable and completely unpredictable systems, oftentimes we just have some small elements of chance or can at least estimate which outcomes are more likely. We see absolute randomness (i.e. complete unpredictability) only with uniform probability distribution, i.e. in variables in which all outcomes are equally likely -- for example rolling a dice. However in real life variables some values are usually more likely than others -- e.g. with adult human male height values such as 175 cm will be much more common than 200 cm; great many real life values actually have [normal distribution](normal_distribution.md) -- the one in which values around some center value are most common.

-**What do random numbers look like?** This is a tricky question. Let's now consider uniform probability distribution, i.e. "absolute randomness". When we see sequences of numbers such as [1, 2, 3, 4, 5, 6, 7], [0, 0, 0, 0, 0, 0, 0, 0] or [9, 1, 4, 7, 8, 1, 5], which are "random" and which not? Intuitively we would say the first two are not random because there is a clear pattern, while the third one looks pretty random. However consider that under our assumption of uniform probability distribution all of these sequences are equally likely to occur!  It is just that there are only very few sequences in which we recognize a common pattern compared to those that look to have no pattern, so we much more commonly see these sequences without a pattern coming out of random number generators and therefore we think the first two patterns are very unlikely to have come from a random source. Indeed they are, but the third, "random looking" sequence is equally unlikely (if you bet the numbers in lottery, you are still very unlikely to win), it just has great many weird looking siblings. You have to be careful, things around probability are great many times very unintuitive and tricky (see e.g. the famous [Monty Hall problem](monty_hall.md)).
+**What do random numbers look like?** This is a tricky question. Let's now consider uniform probability distribution, i.e. "absolute randomness". When we see sequences of numbers such as [1, 2, 3, 4, 5, 6, 7], [0, 0, 0, 0, 0, 0, 0, 0] or [9, 1, 4, 7, 8, 1, 5], which are "random" and which not? Intuitively we would say the first two are not random because there is a clear pattern, while the third one looks pretty random. However consider that under our assumption of uniform probability distribution all of these sequences are equally likely to occur!  It is just that there are only very few sequences in which we recognize a common pattern compared to those that look to have no pattern, so we much more commonly see these sequences without a pattern coming out of random number generators and therefore we think the first two patterns are very unlikely to have come from a random source. Indeed they are, but the third, "random looking" sequence is equally unlikely (if you bet the numbers in lottery, you are still very unlikely to win), it just has great many weird looking siblings. You have to be careful, things around probability are great many times very unintuitive and tricky (see e.g. the famous [Monty Hall problem](monty_hall.md)). **Humans are bad at creating "random" sequences**, or perhaps said better: when you ask someone to come up with a sequence of "random" numbers, it will be a very predictable one, there are many famous demonstrations of this, humans for example tend to produce homogenous sequences of bits without longer streaks of 1s and 0s (and such sequences are quite unlikely to appear randomly). So never try to create your own pseudorandom sequence by randomly pressing numbers on the keyboard. The thing confusing to humans is that randomness is actually NOT a complete absence of patterns, we sometimes will spot familiar patterns in random sequences (for example hearing voices in white noise), but these patterns themselves emerge randomly, there is no way to predict WHICH pattern familiar to our brain will appear.

 Of course we cannot say just from the sequence alone if it was generated randomly or not, the sequences above may have been generated by true randomness or by pseudorandom generator -- we even see this is sort of stupid to ask. We should rather think about what we actually mean by asking whether the sequence is "random" -- to get meaningful answers we have to specify this first. If we formulate the question precisely, we may get precise answers. Sometimes we are looking for lack of patterns -- this can be tested by programs that look for patterns, e.g. [compression](compression.md) programs; number sequences that have regularities in them can be compressed well. We may examine the sequences [entropy](entropy.md) to say something about its "randomness". Mathematicians often like to ask "how likely is it that a sequence with these properties was generated by this model?", i.e. for example listening to signals from space and capturing some numeric sequence, we may compute its properties such as distribution of values in it and then we ask how likely is it that such sequence was generated by some natural source such exploding star or black hole? If we conclude this is very unlikely, we may say the signal was probably not generated randomly and may e.g. come from intelligent lifeforms.

@ -86,6 +86,23 @@ Let's now take a look at how random the sequence looks, i.e. basically how likel

 There are **494 1s and 506 0s**, i.e. the ratio is approximately 0.976, deviating from 1.0 (the value that infinitely many coin tosses should converge to) by only 0.024. We can use the [binomial distribution](binomial_distribution.md) to calculate the "rarity" of getting this deviation or higher one; here we get about 0.728, i.e. a pretty high probability, meaning that if we perform 1000 coin tosses like the one we did, we may expect to get the deviation we got or higher in more than 70% of cases (if on the other hand we only got e.g. 460 1s, this probability would be only 0.005, suggesting the coins we used weren't fair). If we take a look at how the ratio (rounded to two fractional digits) evolves after each round of performing additional 10 coin tosses, we see it gets pretty close to 1 after only about 60 tosses and stabilizes quite nicely after about 100 tosses: 0.67, 0.54, 0.67, 0.90, 0.92, 1.00, 0.94, 0.90, 0.88, 1.00, 1.04, 1.03, 0.97, 1.00, 0.97, 1.03, 1.10, 1.02, 0.98, 0.96, 1.02, 1.02, 1.02, 1.00, 0.95, 0.95, 0.99, 0.99, 0.99, 0.97, 0.95, 0.95, 0.96, 0.93, 0.90, 0.88, 0.90, 0.93, 0.95, 0.98, 0.98, 0.97, 0.97, 0.99, 1.00, 0.98, 0.98, 0.98, 0.97, 0.96, 0.95, 0.94, 0.95, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.95, 0.96, 0.96, 0.97, 0.97, 0.97, 0.95, 0.94, 0.93, 0.93, 0.93, 0.94, 0.94, 0.94, 0.96, 0.95, 0.96, 0.96, 0.95, 0.96, 0.95, 0.95, 0.96, 0.97, 0.97, 0.96, 0.96, 0.95, 0.95, 0.95, 0.96, 0.97, 0.97, 0.97, 0.97, 0.96, 0.97, 0.98, 0.98.

+Next we'll take a look at **streaks**, i.e. uninterrupted runs of the same value (0 or 1). Here it is (*0s*, *1s* and *total* are total streak counts, *0s first* and *1s first* are the positions of first streak occurrence):
+
+| streak len. | 0s  |0s first| 1s  |1s first| total | 
+| ----------- | --- | ------ | --- | ------ | ----- |
+| 1           | 122 | 12     | 126 | 13     | 247   |
+| 2           | 62  | 7      | 65  | 48     | 127   |
+| 3           | 34  | 45     | 27  | 4      | 61    |
+| 4           | 16  | 0      | 17  | 162    | 33    |
+| 5           | 8   | 238    | 10  | 31     | 18    |
+| 6           | 4   | 14     | 3   | 375    | 7     |
+| 7           | 2   | 497    | 2   | 88     | 4     |
+| 8           | 2   | 176    | 1   | 200    | 3     |
+
+At first glance all looks fine, the streak counts are very similar for 1s and 0s and the counts smoothly decrease with streak length, we see no jumps or other red flags in the distribution. More rigorously we should calculate expected values and compare them with what we've got of course, but we'll now suffice with this simple check: supposedly the probabilities of seeing a streak of at least 8 and 9 1s in 1000 tosses are 0.86 and 0.62 respectively, which seems to check out (we've hit the very probably 0.86 case and then fell into the slightly less but still very plausible case of not producing a streak of 9, with probability 0.38).
+
+{ The total streak count seems suspiciously close to 2^(9-n), there's probably a formula but I didn't have time to check it now, TODO: investigate later. ~drummyfish }
+
 Let's try the [chi-squared test](chi_squared_test.md) (the kind of basic "randomness" test): *D = (494 - 500)^2 / 500 + (506 - 500)^2 / 500 = 0.144*; now in the table for the chi square distribution for 1 degree of freedom (i.e. two categories, 0 and 1, minus one) we see this value of *D* falls somewhere around 30%, which is not super low but not very high either, so we can see the test doesn't invalidate the hypothesis that we got numbers from a uniform random number generator. { I did this according to Knuth's *Art of Computer Programming* where he performed a test with dice and arrived at a number between 25% and 50% which he interpreted in the same way. For a scientific paper such confidence would of course be unacceptable because there we try to "prove" the validity of our hypothesis. Here we put much lower confidence level as we're only trying not fail the test. To get a better confidence we'd probably have to perform many more than 1000 tosses. ~drummyfish }

 We can try to convert this to a sequence of integers of different binary sizes and just "intuitively" see if the sequences still looks random, i.e. if there are no patterns such as e.g. the numbers only being odd or the histograms of the sequences being too unbalanced, we could also possibly repeat the chi-squared test etc.
@ -159,4 +176,8 @@ Another way to test data randomness may be by **trying to [compress](compression
 | our random bits      | 144 (115.20%)    |
 | `abcdef...`          | 56 (44.80%)      |

-We see that while the algorithm was able to compress the non-random sequence to less than a half of the original size, it wasn't able to compress our data, it actually made it bigger! This suggests the data is truly random. Of course it would be good to test multiple compression algorithms and see if any one of them finds some regularity in the data, but the general idea has been presented.
+We see that while the algorithm was able to compress the non-random sequence to less than a half of the original size, it wasn't able to compress our data, it actually made it bigger! This suggests the data is truly random. Of course it would be good to test multiple compression algorithms and see if any one of them finds some regularity in the data, but the general idea has been presented.
+
+## See Also
+
+- [pseudorandomness](pseudorandomness.md)