Update

2024-05-27 22:48:10 +02:00 · 2024-05-27 22:48:10 +02:00 · 6b6fe66cd4
commit 6b6fe66cd4
parent 412b7489cc
13 changed files with 1849 additions and 1789 deletions
--- a/pseudorandomness.md
+++ b/pseudorandomness.md
@ -2,6 +2,8 @@

 Pseudorandom data is [data](data.md) that appears (for example by its statistical properties) to have been generated by a [random](random.md) process despite in fact having been generated by a [deterministic](determinism.md) (i.e. non-random) process. I.e. it's a kind of "fake" but mostly [good enough](good_enough.md) randomness that arises from [chaotic](chaos.md) systems -- systems that behave without randomness, by strict rules, but which scramble, shuffle, twist and transform the numbers in such a complicated way that they eliminate obvious patterns and leave the data looking very "random", though the numbers would be scrambled exactly the same way if the process was repeated with the same conditions, i.e. it is possible (if we know how the generator works) to exactly predict which numbers will fall out of a pseudorandom generator. This is in contrast to "true randomness" that (at least to what most physicists probably think) appears in some processes in nature (most notably in [quantum physics](quantum.md)) and which are absolutely unpredictable, even in theory. Pseudorandomness is typically used to emulate true randomness in [computers](computer.md) because for many things ([games](game.md), [graphics](graphics.md), audio,  random sampling, ...) it is absolutely sufficient, it is easy to do AND the repeatability of a pseudorandom sequence is actually an advantage to engineers, e.g. in [debugging](debugging.md) in which we have to replicate bugs we find, or in programs that simply have to behave deterministic (e.g. many network games). True randomness is basically only ever needed for [cryptography](cryptography.md)/[security](security.md) (or possibly for rare applications where we absolutely NEED to ensure lack of any patterns in the data), it is a bit harder to achieve because we need some unbiased source of real-world random data. Pseudorandom generators are so common that in most contexts in programming the word "random" silently actually means just "pseudorandom".

+A saying about psedorandom numbers states that "randomness is a task too important to be left to chance".
+
 ## How It Works

 Firstly let's mention that we can use [look up tables](lut.md), i.e. embed some high quality random data right into our program and then use that as our random numbers, taking one after another and getting back to start once we run out of them. This is for example how [Doom](doom.md)'s pseudorandom generator worked. This is easy to do and extremely fast, but will take up some memory and will offer only a quite limited sized sequence (your generator will have a short period), so ponder on the pros and cons for your specific needs. From now on we'll leave this behind and will focus on really GENERATING the pseudorandom values with some [algorithm](algorithm.md), but look up tables may still be kept in mind (they might even perhaps be somehow combined with the true generators).
@ -16,7 +18,7 @@ The number of bits that the generator takes from its internal number and gives y

 Now let's realize another important thing -- if the generator has some internal number, which is the only thing that determines the next number, and if its internal number has some limited size -- let's say for example 32 bits -- then the sequence HAS TO start repeating sometimes because there is a limited number of values the internal number can be in and once we get to the same number, it will have to evolve the same way it evolved before (because we have a deterministic generator and the number is the whole generator's state). Imagine this as a [graph](graph.md): numbers are nodes, the seed is the node we start in, there are finitely many nodes and each one leads to exactly one other node -- going through this graph you inevitably have to end up running in some loop. For this reason we talk about the **period** of a pseudorandom generator -- this period says after how many values the sequence will start to repeat. In fact it is possible that only last *N* values of the initial sequence will start to repeat -- again, if you imagine the graph, it is possible that an initial path leads us to some smaller loop in which we then keep cycling. This may depend on the seed, so the whole situation can get a bit messy, but we can resolve this, just hold on.

-It's not hard to see that the period of the generator can be at most 2 to the power of the number of bits of the generator's internal value (i.e. the number of possible distinct values the number can be, or the nodes in the graph). **We want to achieve this maximum period** -- indeed, it is ideal if we can make it as long as possible, but achieving the maximum period will also mean the period won't depend on seed. If you imagine the graph, having a big loop over all the values means that there are no other loops, there's just one long repeating sequence in which each internal value appears exactly once, so no matter where we start (which seed we set), we'll always end up being in the same big loop.
+It's not hard to see that the period of the generator can be at most 2 to the power of the number of bits of the generator's internal value (i.e. the number of possible distinct values the number can be, or the nodes in the graph). **We want to achieve this maximum period** -- indeed, it is ideal if we can make it as long as possible, but achieving the maximum period will also mean the period won't depend on the initial seed! If you imagine the graph, having a big loop over all the values means that there are no other loops, there's just one long repeating sequence in which each internal value appears exactly once, so no matter where we start (which seed we set), we'll always end up being in the same big loop. In addition to this we ALSO get another awesome thing: the histogram (the count) of all values over the whole period will be absolutely uniform, i.e. every value generated during one period will appear exactly the same number of times (which is what we expect from a completely random, uniform generator) -- this can be seen from the fact that we are returning *N* bits of some bigger internal number of *N + M* bits, which will come through each possible value exactly once, so each possible value of *N* will have to appear and each of these values will have to appear with all possible values of the remaining *M* bits, which will be the same for all values.

 Now let's take a look at specific generators, i.e. types of algorithms to generate the numbers. There are many different kinds, for example [Mersenne Twister](mersenne_twister.md), middle square etc., however probably the **most common type** is the **[linear congruential generator](lcg.md)** (LCG) -- though for many decades now it's been known this type of generator has some issues (for example less significant bits have shorter and shorter periods, for which we usually want to use a very big internal value and return its highest bits as the result), it will likely suffice for most of your needs, but you have to be careful about choosing the generator's parameters correctly. It works like this: given a current (internal) number *x[n]* (which is initially set to the seed number), the next number in the sequence is computed as

@ -56,8 +58,55 @@ Here `T` is the data type of the internal number (implying the *M* constant) --

 { I pulled the above numbers from various sources I found, mentioned in the note, tried to select the ones that were allegedly good, I also quickly tested them myself, checked the period was at maximum at least for the 32 bit generators and lower. ~drummyfish }

+Let's also quickly mention **another kind of generator** as an alternative -- the *middle square plus Weyl sequence* generator. Middle square generator was one of the first and is very simple, it simply starts with a number (seed), squares it, takes its middle digits as the next number, squares it, takes its middle digits and so on. The issue with this was mainly getting a number 0, at which we get stuck. A 2022 paper by *Wydinski* seems to have solved this issue by employing so called *Weyl* sequence -- basically just adding some odd number in each step, though the theory is a bit more complex, the paper goes on to prove a high period of this generator. An issue seems to be with seeding the sequence -- the generator has three internal numbers and they can't just be blindly set to "anything" (the paper gives some hints on how to do this). Here is a 32 bit variant of such generator (the paper gives a 64 bit one):
+
+{ I tried to make a 32 bit version of the generator, tried to choose the `_rand3` constant well -- after quickly testing this the values of the generator looked alright, though I just eyeballed the numbers, each bit separately, checked the mean of some 4000 values and the histogram of 1 million values. I'm not claiming this version to be statistically good, but it may be a start for implementing something nice, use at own risk. ~drummyfish }
+
+```
+#include <stdint.h>
+
+uint32_t _rand1, _rand2, _rand3 = 0x5e19dbae;
+
+uint16_t random()
+{
+  _rand2 += _rand3;
+  _rand1 = _rand1 * _rand1 + _rand2;
+  _rand1 = (_rand1 >> 16) | (_rand1 << 16);
+
+  return _rand1;
+}
+```
+
+NOTE on the code: the `(_rand1 >> 16) | (_rand1 << 16)` operation effectively makes the function return lower 16 bits of the squared number's middle digits, as multiplying `_rand1` (32 bit) by itself results in the lower half of a 64 bit result.
+
+Yet another idea might be to use some good [hash](hash.md) just on numbers 1, 2, 3, 4 ... The difference here is we are not computing the pseudorandom number from the previous one, but we're computing *N*th pseudorandom number directly from *N*. This will probably be slower. For example: { Again, no big guarantees. ~drummyfish }
+
+```
+#include <stdint.h>
+
+uint32_t _rand = 0;
+
+uint32_t random()
+{
+  uint32_t x = _rand;
+  _rand++;
+  
+  x = 303484085 * (x ^ (x >> 15));
+  x = 985455785 * (x ^ (x >> 15));
+  return x ^ (x >> 15);
+}
+
+void randomSeed(uint32_t seed)
+{
+  _rand = seed;    // this alone just offsets the sequence
+  seed = random(); // this is an attempt at fix
+}
+```
+
 **How to generate a number in certain desired range?** As said your generator will be giving you numbers of certain fixed number of bits, usually something like 16 or 32, which means you'll be getting numbers in range 0 to 2^bits - 1. But what if you want to get numbers in some specific range *A* to *B* (including both)? To do this you just need to generate a number in range 0 to *B - A* and then add *A* to it (e.g. to generate number from 20 to 30 you generate a number from 0 to 10 and add 20). So let's just suppose we want a number in range 0 to *N* (where *N* can be *B - A*). Let's now suppose *N* is lower than the upper range of our generator, i.e. that we want to get the number into a small range (if this is not the case, we can arbitrarily increase the range of our generator simply by generating more random bits with it, i.e we can join two 16 bit numbers to get a 32 bit number etc.). Now the most common way to get the number in the desired range is by using *modulo (N + 1)* operation, i.e. in [C](c.md) we simply do something like `int random0to100 = random() % 101;`. This easily forces the number we get into the range we want. HOWEVER beware, there is one statistical trap about this called the **modulo bias** that makes some numbers slightly more likely to be generated than others, i.e. it biases our distribution a little bit. For example imagine our generator gives us numbers from 0 to 15 and we want to turn it into range 0 to 10 using the modulo operator, i.e. we'll be doing *mod 11* operation -- there are two ways to get 0 (*0 mod 11* and *11 mod 11*) but only one way to get 9 (*9 mod 11*), so number 0 is twice as likely here. In practice this effect isn't so strong and in many situations we don't really mind it, but we have to be aware of this effects for the sake of cases where it may matter. If necessary, the effect can be reduced -- we may for example realize that modulo bias will be eliminated if the upper range of our generator is a multiple of the range into which we'll be converting, so we can just repeatedly generate numbers until it falls under a limit that's a highest multiple of our desired range lower than the true range of the generator.

+**What if we want [floating point](float.md)/[fixed point](fixed_point.md) numbers?** Just convert the integer result to that format somehow, for example `((float) random()) / ((float) RANDOM_MAX_VALUE)` will produce a floating point number in range 0 to 1.
+
 **How to generate other probability distributions?** Up until now we supposed a uniform probability distribution, i.e. the most random kind of generator that has an equal probability of generating any number. But sometimes we want a different distribution, i.e. we may want some numbers to be more likely to be generated than others. For this we normally start with the uniform generator and then convert the number into the new distribution. For that we may make use of the following:

 - **Averaging many uniform distributions converges to normal distribution** -- this is called *central limit theorem* and in fact works even more generally (the averaged distribution doesn't have to be uniform), but to us it's enough to know that if we want normally distributed random numbers, we can just average many uniformly distributed variables. Intuitively this makes sense -- averaging many numbers will likely be close to the mean value, it's very unlikely to be close to either end as to get an extreme average we would have to roll only numbers close to that one extreme.
@ -78,22 +127,21 @@ However the core of a pseudorandom generator is the quality of the sequence itse
 - **Check the sequence period**: You want the longest possible period of your generator, i.e. if your generator has an *N* bit internal number and you start with number *A*, you get back to *A* after *2^N* steps and no sooner -- this will ensure not only the maximum period length, but also that the period length will be the same for every starting seed! That's because in this ideal case you simply have a single cycle over all the possible internal number values.
 - **Try to [compress](compression.md) the sequence**: Truly random data should be basically impossible to compress -- you can exploit this fact and try to compress your sequence with some compression programs. It is ideal if the compression programs end up enlarging the file.
 - **Statistical tests**: Here you use objective mathematical tests -- there exist many very advanced tests, we'll only mention the very simple ones.
-  - **[Histogram](histogram.md)**: Generate many numbers (but not more than the generator's period) and make a histogram (i.e. for every number count how many times it appeared) -- all numbers should appear roughly with the same frequency. Also count 1 and 0 bits in the whole sequence -- there should be almost the same number of them. But keep in mind this only checks if you have correct frequencies of numbers, it says nothing about their distribution. Even a sequence 1, 2, 3, 4, 5, .... will pass this.
+  - **[Histogram](histogram.md)**: Generate many numbers (but not more than the generator's period) and make a histogram (i.e. for every number count how many times it appeared) -- all numbers should appear roughly with the same frequency. If you make a nice generator, you should even see exactly the same count for every value generated -- this is explained above. Also count 1 and 0 bits in the whole sequence -- again there should be about the same number of them (exactly the same if you do it correctly). But keep in mind this only checks if you have correct frequencies of numbers, it says nothing about their distribution. Even a sequence 1, 2, 3, 4, 5, .... will pass this.
  - **Averaging any non-short interval should be close to middle value**: In a random sequence it should hold that if you take any interval that's not too short -- let's say at leas 100 numbers in a row -- the average value should very likely be close to the middle value (the longer the interval, the closer it should be). You can test your sequence like this. This already takes into account even the distribution of the numbers.
  - **[Fourier transform](fourier_transform.md)** (and similar methods that give you the spectrum) -- the spectrum of the data should have equal amount of all frequencies, just like white noise.
-  - **[Correlation](correlation.md) coefficients**: You can try to compute some correlation coefficients, for example try to compute how much correlation there is between consecutive numbers (it's similar to plotting the data as coordinates and seeing if they form a line or not) -- you should ideally find no significant correlations.
+  - **[Correlation](correlation.md) coefficients**: This is kind of the real proof of randomness, ideally no values should be correlated in your data, so you can try to compute some correlation coefficients, for example try to compute how much correlation there is between consecutive numbers (it's similar to plotting the data as coordinates and seeing if they form a line or not) -- you should ideally find no significant correlations.
  - **Chi square test**: Very common test for this kind of thing, see also *poker test*.
-  - **[Monte Carlo](monte_carlo.md) tests**: Monte Carlo algorithms use random sampling to compute a certain desired value -- for example the value of [pi](pi.md). These suppose that we can sample certain space randomly, so we can exploit this -- if we know what result we want to get (for example we already know the value of pi) we can test the algorithm with our generator and see if we get the desired result -- if we come close to the desired result, we can be a bit more confident that our sampling was random (however we cannot be certain of it -- like with any testing we can only ever be certain about the presence of an error, not about the lack of it).
+  - **[Monte Carlo](monte_carlo.md) tests**: Monte Carlo algorithms use random sampling to compute a certain desired value -- for example the value of [pi](pi.md). These suppose that we can sample certain space randomly, so we can exploit this -- if we know what result we want to get (for example we already know the value of pi) we can test the algorithm with our generator and see if we get the desired result -- if we come close to the desired result, we can be a bit more confident that our sampling was random, however we cannot be certain of it -- like with any testing we can only ever be certain about the presence of an error, not about the lack of it. Even a very dense, regular grid of points would probably pass this.
  - **The cool uber randomness test** described in article on [randomness](randomness.md) ;) Basically every number (and by extension any sequence of numbers) should be equally likely to be followed by any other number.
  - For the linear congruential generators there's a so called spectral test, it seems to be the one true test for that kind of generators, make sure to do it if you're aiming for the top generator.
  - ...
 - **Test programs**: There exist programs that do the automatic tests for you, for example [ent](ent.md).
 - ...

-TODO: add some advanced generator code, e.g. the mersene twister or the middle square + Weyl
-
 ## See Also

 - [pseudo](pseudo.md)
 - [randomness](randomness.md)
- [noise](noise.md)
+- [noise](noise.md)
+- [bytebeat](bytebeat.md)