less_retarded_wiki/pseudorandomness.md

147 lines
28 KiB
Markdown
Raw Normal View History

2024-02-18 01:42:30 +01:00
# Pseudorandomness
2024-05-25 21:44:26 +02:00
Pseudorandom data is [data](data.md) that appears (for example by its statistical properties) to have been generated by a [random](random.md) process despite in fact having been generated by a [deterministic](determinism.md) (i.e. non-random) process. I.e. it's a kind of "fake" but mostly [good enough](good_enough.md) randomness that arises from [chaotic](chaos.md) systems -- systems that behave without randomness, by strict rules, but which scramble, shuffle, twist and transform the numbers in such a complicated way that they eliminate obvious patterns and leave the data looking very "random", though the numbers would be scrambled exactly the same way if the process was repeated with the same conditions, i.e. it is possible (if we know how the generator works) to exactly predict which numbers will fall out of a pseudorandom generator. This is in contrast to "true randomness" that (at least to what most physicists probably think) appears in some processes in nature (most notably in [quantum physics](quantum.md)) and which are absolutely unpredictable, even in theory. Pseudorandomness is typically used to emulate true randomness in [computers](computer.md) because for many things ([games](game.md), [graphics](graphics.md), audio, random sampling, ...) it is absolutely sufficient, it is easy to do AND the repeatability of a pseudorandom sequence is actually an advantage to engineers, e.g. in [debugging](debugging.md) in which we have to replicate bugs we find, or in programs that simply have to behave deterministic (e.g. many network games). True randomness is basically only ever needed for [cryptography](cryptography.md)/[security](security.md) (or possibly for rare applications where we absolutely NEED to ensure lack of any patterns in the data), it is a bit harder to achieve because we need some unbiased source of real-world random data. Pseudorandom generators are so common that in most contexts in programming the word "random" silently actually means just "pseudorandom".
2024-02-18 01:42:30 +01:00
2024-05-27 22:48:10 +02:00
A saying about psedorandom numbers states that "randomness is a task too important to be left to chance".
2024-05-25 21:44:26 +02:00
## How It Works
Firstly let's mention that we can use [look up tables](lut.md), i.e. embed some high quality random data right into our program and then use that as our random numbers, taking one after another and getting back to start once we run out of them. This is for example how [Doom](doom.md)'s pseudorandom generator worked. This is easy to do and extremely fast, but will take up some memory and will offer only a quite limited sized sequence (your generator will have a short period), so ponder on the pros and cons for your specific needs. From now on we'll leave this behind and will focus on really GENERATING the pseudorandom values with some [algorithm](algorithm.md), but look up tables may still be kept in mind (they might even perhaps be somehow combined with the true generators).
There are possibly many ways to approach generating pseudorandom numbers (for example you can just run some chaotic [cellular automaton](cellular_automaton.md) and then convert it to numbers somehow), however we'll be describing the most typical way, used in implementation of programming languages etc.
Pseudorandom generators generate an infinite (but in practice eventually repeating) **sequence** of numbers from some initial starting number which we call a **[seed](seed.md)** (i.e. it's important to rather think of sequences than of individual "random numbers" -- a number itself can hardly be random). For the same seed number the same sequence will always be generated (of course supposing we use the same generator). If you ask the generator for the next pseudorandom number, it will just generate the next number in the sequence and pass it to you. (In practice programs often use system time as the generator's seed number because that will make the program generate different numbers each time it is run.)
The next number in the sequence is typically only generated from the previous number by performing some chaotic operations with it that transform it into a new number. However this is not entirely true because then the generator would have the following weakness: if it generated number *A* and then *B*, then every number *A* would ALWAYS be followed by number *B* -- that doesn't look exactly random. For this reason the generator internally keeps a big number with which it operates and it only returns some *N* [bits](bit.md) (for example the highest 16 bits) of that number to you. This way it is possible (from your point of view) to have the same number be followed by different numbers (though this doesn't hold for the generator's internal value of course, but that doesn't bother us, we are only looking at the output sequence).
The number of bits that the generator takes from its internal number and gives you determines the **range** of random values that you can get. For example if the generator gives you 16 bit numbers, you know the numbers will be in range 0 to 65535. (This can be used to also generate numbers in any range but we'll look at that later.)
Now let's realize another important thing -- if the generator has some internal number, which is the only thing that determines the next number, and if its internal number has some limited size -- let's say for example 32 bits -- then the sequence HAS TO start repeating sometimes because there is a limited number of values the internal number can be in and once we get to the same number, it will have to evolve the same way it evolved before (because we have a deterministic generator and the number is the whole generator's state). Imagine this as a [graph](graph.md): numbers are nodes, the seed is the node we start in, there are finitely many nodes and each one leads to exactly one other node -- going through this graph you inevitably have to end up running in some loop. For this reason we talk about the **period** of a pseudorandom generator -- this period says after how many values the sequence will start to repeat. In fact it is possible that only last *N* values of the initial sequence will start to repeat -- again, if you imagine the graph, it is possible that an initial path leads us to some smaller loop in which we then keep cycling. This may depend on the seed, so the whole situation can get a bit messy, but we can resolve this, just hold on.
2024-05-27 22:48:10 +02:00
It's not hard to see that the period of the generator can be at most 2 to the power of the number of bits of the generator's internal value (i.e. the number of possible distinct values the number can be, or the nodes in the graph). **We want to achieve this maximum period** -- indeed, it is ideal if we can make it as long as possible, but achieving the maximum period will also mean the period won't depend on the initial seed! If you imagine the graph, having a big loop over all the values means that there are no other loops, there's just one long repeating sequence in which each internal value appears exactly once, so no matter where we start (which seed we set), we'll always end up being in the same big loop. In addition to this we ALSO get another awesome thing: the histogram (the count) of all values over the whole period will be absolutely uniform, i.e. every value generated during one period will appear exactly the same number of times (which is what we expect from a completely random, uniform generator) -- this can be seen from the fact that we are returning *N* bits of some bigger internal number of *N + M* bits, which will come through each possible value exactly once, so each possible value of *N* will have to appear and each of these values will have to appear with all possible values of the remaining *M* bits, which will be the same for all values.
2024-05-25 21:44:26 +02:00
Now let's take a look at specific generators, i.e. types of algorithms to generate the numbers. There are many different kinds, for example [Mersenne Twister](mersenne_twister.md), middle square etc., however probably the **most common type** is the **[linear congruential generator](lcg.md)** (LCG) -- though for many decades now it's been known this type of generator has some issues (for example less significant bits have shorter and shorter periods, for which we usually want to use a very big internal value and return its highest bits as the result), it will likely suffice for most of your needs, but you have to be careful about choosing the generator's parameters correctly. It works like this: given a current (internal) number *x[n]* (which is initially set to the seed number), the next number in the sequence is computed as
*x[n + 1] = (A * x[n] + C) mod M*
where *A*, *C* and *M* are constants -- these are the parameters of the generator. I.e. the next number in the sequence is computed by taking the current number, multiplying it by some constant, adding some constant and then taking modulo (remainder after division) of this by some constant. The catch is in choosing the right constants *A*, *C* and *M* -- it seems there are only a few combinations of these constants that will yield quality sequences. Furthermore we try to make the equation fast to compute, so we often aim to choose *M* to be some power of two -- if we are for example using 32 bit numbers, then choosing *M* to be 2^32 is extremely convenient, we simply let the number overflow (the overflow does the same thing as modulo by 2^32) or, at worse, perform a simple logical [and](and.md) to mask out the lowest bits. It would be cool to also have *A* a power of two because then also the multiplication would be very fast and simple (just a bit shift), BUT this can't really be done because then the multiplication wouldn't actually achieve much randomness, it would just shift the number left, meaning the higher bits of *x[n + 1]* would really be the same (or at least corelated with) the lower bits of *x[n]*.
So how to choose the *A*, *C* and *M* constants? Let's say we take *M* to be a power of two, for reasons stated above. Now it is proven (Hull-Dobell theorem) that we'll get the maximum possible period (the thing we absolutely WANT) if exactly all of the following conditions hold: *M* and *C* must have the highest common divisor 1 (i.e. they must be coprime) AND *A - 1* is divisible by all prime factors of *M* AND *A - 1* is divisible by 4 if *M* is divisible by 4. So basically we want all of these conditions to always hold -- however even if they hold, we don't necessarily have a statistically good generator yet (to ensure this see the tests below). The value of *C* isn't very significant, it's enough if the above conditions hold for it, so many just use e.g. 1 or some prime number, but choosing *A* correctly turns out to be quite hard.
Here is a template for a [C](c.md) linear congruential generator:
```
T _currentRand; // the internal number
void randomSeed(unsigned int seed)
{
_currentRand = seed;
}
unsigned int random()
{
_currentRand = A * _currentRand + C;
return _currentRand >> S;
}
```
Here `T` is the data type of the internal number (implying the *M* constant) -- use some fixed width type from `stdint.h` here. `C` is the multiplicative constant, `A` is the additive constant and `S` is a shift that implies how many highest bits of the internal number will be taken (and therefore what range of numbers we'll be getting). The following table gives you a few hopefully good values you can just plug into this snippet to get an alright generator:
| `T` | `A` | `C` | `S` | note |
| ------------- | ------------------- | --------- | -------- | -------------------- |
| `uint32_t` | 32310901 | 37 | 16 or 24 | *A* from *L'Ecuyer* |
| `uint32_t` | 2891336453 | 123 | 16 or 24 | *A* from *L'Ecuyer* |
| `uint32_t` | 2147001325 | 715136305 | 16 | from *Entacher 1999* |
| `uint32_t` | 22695477 | 1 | 16 | from *Entacher 1999* |
| `uint64_t` | 2862933555777941757 | 12345 | 32 or 48 | *A* from *L'Ecuyer* |
| `uint64_t` | 3935559000370003845 | 1719 | 32 or 48 | *A* from *L'Ecuyer* |
{ I pulled the above numbers from various sources I found, mentioned in the note, tried to select the ones that were allegedly good, I also quickly tested them myself, checked the period was at maximum at least for the 32 bit generators and lower. ~drummyfish }
2024-05-27 22:48:10 +02:00
Let's also quickly mention **another kind of generator** as an alternative -- the *middle square plus Weyl sequence* generator. Middle square generator was one of the first and is very simple, it simply starts with a number (seed), squares it, takes its middle digits as the next number, squares it, takes its middle digits and so on. The issue with this was mainly getting a number 0, at which we get stuck. A 2022 paper by *Wydinski* seems to have solved this issue by employing so called *Weyl* sequence -- basically just adding some odd number in each step, though the theory is a bit more complex, the paper goes on to prove a high period of this generator. An issue seems to be with seeding the sequence -- the generator has three internal numbers and they can't just be blindly set to "anything" (the paper gives some hints on how to do this). Here is a 32 bit variant of such generator (the paper gives a 64 bit one):
{ I tried to make a 32 bit version of the generator, tried to choose the `_rand3` constant well -- after quickly testing this the values of the generator looked alright, though I just eyeballed the numbers, each bit separately, checked the mean of some 4000 values and the histogram of 1 million values. I'm not claiming this version to be statistically good, but it may be a start for implementing something nice, use at own risk. ~drummyfish }
```
#include <stdint.h>
uint32_t _rand1, _rand2, _rand3 = 0x5e19dbae;
uint16_t random()
{
_rand2 += _rand3;
_rand1 = _rand1 * _rand1 + _rand2;
_rand1 = (_rand1 >> 16) | (_rand1 << 16);
return _rand1;
}
```
NOTE on the code: the `(_rand1 >> 16) | (_rand1 << 16)` operation effectively makes the function return lower 16 bits of the squared number's middle digits, as multiplying `_rand1` (32 bit) by itself results in the lower half of a 64 bit result.
Yet another idea might be to use some good [hash](hash.md) just on numbers 1, 2, 3, 4 ... The difference here is we are not computing the pseudorandom number from the previous one, but we're computing *N*th pseudorandom number directly from *N*. This will probably be slower. For example: { Again, no big guarantees. ~drummyfish }
```
#include <stdint.h>
uint32_t _rand = 0;
uint32_t random()
{
uint32_t x = _rand;
_rand++;
x = 303484085 * (x ^ (x >> 15));
x = 985455785 * (x ^ (x >> 15));
return x ^ (x >> 15);
}
void randomSeed(uint32_t seed)
{
_rand = seed; // this alone just offsets the sequence
seed = random(); // this is an attempt at fix
}
```
2024-05-25 21:44:26 +02:00
**How to generate a number in certain desired range?** As said your generator will be giving you numbers of certain fixed number of bits, usually something like 16 or 32, which means you'll be getting numbers in range 0 to 2^bits - 1. But what if you want to get numbers in some specific range *A* to *B* (including both)? To do this you just need to generate a number in range 0 to *B - A* and then add *A* to it (e.g. to generate number from 20 to 30 you generate a number from 0 to 10 and add 20). So let's just suppose we want a number in range 0 to *N* (where *N* can be *B - A*). Let's now suppose *N* is lower than the upper range of our generator, i.e. that we want to get the number into a small range (if this is not the case, we can arbitrarily increase the range of our generator simply by generating more random bits with it, i.e we can join two 16 bit numbers to get a 32 bit number etc.). Now the most common way to get the number in the desired range is by using *modulo (N + 1)* operation, i.e. in [C](c.md) we simply do something like `int random0to100 = random() % 101;`. This easily forces the number we get into the range we want. HOWEVER beware, there is one statistical trap about this called the **modulo bias** that makes some numbers slightly more likely to be generated than others, i.e. it biases our distribution a little bit. For example imagine our generator gives us numbers from 0 to 15 and we want to turn it into range 0 to 10 using the modulo operator, i.e. we'll be doing *mod 11* operation -- there are two ways to get 0 (*0 mod 11* and *11 mod 11*) but only one way to get 9 (*9 mod 11*), so number 0 is twice as likely here. In practice this effect isn't so strong and in many situations we don't really mind it, but we have to be aware of this effects for the sake of cases where it may matter. If necessary, the effect can be reduced -- we may for example realize that modulo bias will be eliminated if the upper range of our generator is a multiple of the range into which we'll be converting, so we can just repeatedly generate numbers until it falls under a limit that's a highest multiple of our desired range lower than the true range of the generator.
2024-05-27 22:48:10 +02:00
**What if we want [floating point](float.md)/[fixed point](fixed_point.md) numbers?** Just convert the integer result to that format somehow, for example `((float) random()) / ((float) RANDOM_MAX_VALUE)` will produce a floating point number in range 0 to 1.
2024-05-25 21:44:26 +02:00
**How to generate other probability distributions?** Up until now we supposed a uniform probability distribution, i.e. the most random kind of generator that has an equal probability of generating any number. But sometimes we want a different distribution, i.e. we may want some numbers to be more likely to be generated than others. For this we normally start with the uniform generator and then convert the number into the new distribution. For that we may make use of the following:
- **Averaging many uniform distributions converges to normal distribution** -- this is called *central limit theorem* and in fact works even more generally (the averaged distribution doesn't have to be uniform), but to us it's enough to know that if we want normally distributed random numbers, we can just average many uniformly distributed variables. Intuitively this makes sense -- averaging many numbers will likely be close to the mean value, it's very unlikely to be close to either end as to get an extreme average we would have to roll only numbers close to that one extreme.
- **Elimination method**: This is a very straightforward method, which however requires iteration (i.e. it may not be the fastest). If you know the probability density function of your desired distribution, you can randomly generate 2D points *[x,y]* and once one of them falls under the graph of the density function, you take *x* as the generated value. Again, this makes intuitive sense -- values that have low probability according to the distribution have also lower probability to be generated this way as they have a smaller range of passing values.
- **Inverse transform method**: In this method we take the inverse function of our desired distribution's cumulative distribution function (function that for any *x* says the probability the outcome will be lower than or equal to *x*) and we pass to it a number in range 0 to 1 generated with the vanilla uniform distribution -- this gives us the number in the desired distribution. However for this we have to know the cumulative distribution function's inverse function.
- ...
## Quality/Testing Of Pseudorandom Sequences/Generators
This topic alone can be extremely complex, you could probably do a [PhD](phd.md) on it, let's just do some overview for mere noobs like us.
Firstly you want your generator to be simply a good program in general, i.e. you want it to be as fast, small and as [simple](simple.md) as possible while generating nice sequences. Consider that someone will want to use it to for example generate a white noise for a whole HD screen 60 times per second -- that may be something like 100 million numbers per second! We often judge generators by type and number of operations they need to generate the next number, as that will imply the speed -- as always, you generally want to avoid general division, branching, minimize multiplications and so on. Typically you'll probably want something like one multiplication and a few fast operations like an addition and modulo by power of two. That's basically what the congruential generators do. See also [optimization](optimization.md).
However the core of a pseudorandom generator is the quality of the sequence itself, i.e. ensuring it really looks very random, that it has no patterns. To a noob this sounds easy, he thinks that if you just make some random shit with the numbers it's going to look random, he also thinks that the more things you do with the numbers the more random they'll look -- this is FALSE (typically if you just throw in more operations without thinking you'll make the sequence worse). It's not easy at all to make a random looking sequence of numbers! There is a danger in that with some effort you can make a sequence that will look random to you, but once you use the sequence to generate something -- for example some kind of noise -- you'll see there is actually some pattern, for example you'll see some straight lines in the noise etc. For some applications it may be sufficient to have a lower quality generator, but the danger lies in the cases in which you need a good generator and you think you have one while in fact you don't -- typically scientific simulation have to be extremely careful about this. For example if you're doing [genetic programming](genetic_programming.md) and you need to somehow randomly pick organisms so as to make the evolution fair -- if you have a bad generator, your program won't work because it may secretly be not too random in its choices, it will be unfair -- in this case it will be VERY hard for you to find the cause of why your program doesn't work (in the worse case you'll think it works and will trust its results when in fact the results are bad). To avoid this trap you need to actually test the quality of your sequence (TBH if you want a really good generator you should definitely use something better than the mentioned congruential generator). Let's see some general stuff you can try (considering the base case of generating uniformly distributed numbers):
- **Intuitive check with your senses**: Obviously you'll take a look at the numbers and see if they look random, additionally you can also for example convert the numbers to another base, interpret the sequence as a sequence of different length words (8bit, 10bit, 16bit, ...), convert them to a string, reverse their bits and so on -- it should always look random. Next convert the data to something visual and maybe even auditory, your senses are trained to spot patterns in data. Firstly make a normal plot of the values, it should look like random noise, going up and down completely randomly. Then make a 2D image of the data -- for example plot the values as colors, plot each bit separately etc. -- keep changing the picture width and see if patterns appear. You can also try to make something like [Ulam spiral](ulam_spiral.md). Another way to make a picture (or a 3D point cloud) is to interpret the sequence as a sequence of 2D (or 3D) coordinates and plotting them as points. You must see no patterns in these images, such as clear lines, rectangles, ripples or chunks, it has to look like static [noise](noise.md). You can possibly also try to play the data as a sound, you should hear no tones, or rhythm or anything, just static.
- **Each bit separately has to look random**: Check each bit of the data separately, i.e. firstly take a look only at sequence of the lowest bits of each number -- it has to look random. If you see for example 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, ..., you see it's not really random (if this is the lowest bit the sequence here implies you're getting an odd number, then even, then odd, then even etc.). Do all the tests for this bit alone, then check another bit etc.
- **Check the sequence period**: You want the longest possible period of your generator, i.e. if your generator has an *N* bit internal number and you start with number *A*, you get back to *A* after *2^N* steps and no sooner -- this will ensure not only the maximum period length, but also that the period length will be the same for every starting seed! That's because in this ideal case you simply have a single cycle over all the possible internal number values.
- **Try to [compress](compression.md) the sequence**: Truly random data should be basically impossible to compress -- you can exploit this fact and try to compress your sequence with some compression programs. It is ideal if the compression programs end up enlarging the file.
- **Statistical tests**: Here you use objective mathematical tests -- there exist many very advanced tests, we'll only mention the very simple ones.
2024-05-27 22:48:10 +02:00
- **[Histogram](histogram.md)**: Generate many numbers (but not more than the generator's period) and make a histogram (i.e. for every number count how many times it appeared) -- all numbers should appear roughly with the same frequency. If you make a nice generator, you should even see exactly the same count for every value generated -- this is explained above. Also count 1 and 0 bits in the whole sequence -- again there should be about the same number of them (exactly the same if you do it correctly). But keep in mind this only checks if you have correct frequencies of numbers, it says nothing about their distribution. Even a sequence 1, 2, 3, 4, 5, .... will pass this.
2024-05-25 21:44:26 +02:00
- **Averaging any non-short interval should be close to middle value**: In a random sequence it should hold that if you take any interval that's not too short -- let's say at leas 100 numbers in a row -- the average value should very likely be close to the middle value (the longer the interval, the closer it should be). You can test your sequence like this. This already takes into account even the distribution of the numbers.
- **[Fourier transform](fourier_transform.md)** (and similar methods that give you the spectrum) -- the spectrum of the data should have equal amount of all frequencies, just like white noise.
2024-05-27 22:48:10 +02:00
- **[Correlation](correlation.md) coefficients**: This is kind of the real proof of randomness, ideally no values should be correlated in your data, so you can try to compute some correlation coefficients, for example try to compute how much correlation there is between consecutive numbers (it's similar to plotting the data as coordinates and seeing if they form a line or not) -- you should ideally find no significant correlations.
2024-05-25 21:44:26 +02:00
- **Chi square test**: Very common test for this kind of thing, see also *poker test*.
2024-05-27 22:48:10 +02:00
- **[Monte Carlo](monte_carlo.md) tests**: Monte Carlo algorithms use random sampling to compute a certain desired value -- for example the value of [pi](pi.md). These suppose that we can sample certain space randomly, so we can exploit this -- if we know what result we want to get (for example we already know the value of pi) we can test the algorithm with our generator and see if we get the desired result -- if we come close to the desired result, we can be a bit more confident that our sampling was random, however we cannot be certain of it -- like with any testing we can only ever be certain about the presence of an error, not about the lack of it. Even a very dense, regular grid of points would probably pass this.
2024-05-25 21:44:26 +02:00
- **The cool uber randomness test** described in article on [randomness](randomness.md) ;) Basically every number (and by extension any sequence of numbers) should be equally likely to be followed by any other number.
- For the linear congruential generators there's a so called spectral test, it seems to be the one true test for that kind of generators, make sure to do it if you're aiming for the top generator.
- ...
- **Test programs**: There exist programs that do the automatic tests for you, for example [ent](ent.md).
- ...
## See Also
- [pseudo](pseudo.md)
- [randomness](randomness.md)
2024-05-27 22:48:10 +02:00
- [noise](noise.md)
- [bytebeat](bytebeat.md)