This commit is contained in:
Miloslav Ciz 2023-07-25 20:55:42 +02:00
parent e267fff78e
commit e105277181
14 changed files with 60 additions and 14 deletions

View file

@ -17,7 +17,9 @@ Furthermore we may divide compression e.g. to offline (compresses a whole file,
The following is an example of how well different types of compression work for an image (screenshot of main page of Wikimedia Commons, 1280x800):
| compression | size (Kb) | ratio |
{ Though the website screenshot contained also real life photos, it still contained a lot of constant color areas which can be compressed very well, hence quite good compression ratios here. A general photo won't be compressed as much. }
| compression | ~size (KB)| ratio |
| --------------------------------------------------- | --------- | ------ |
| none | 3000 | 1 |
| general lossless (lz4) | 396 | 0.132 |
@ -29,7 +31,7 @@ Mathematically there cannot exist a lossless compression algorithm that would al
**Dude, how does compression really work tho?** The basic principle of lossless compression is **removing [redundancy](redundancy.md)** ([correlations](correlation.md) in the data), i.e. that which is explicitly stored in the original data but doesn't really have to be there because it can be reasoned out from the remaining data. This is why a completely random [noise](noise.md) can't be compressed -- there is no correlated data in it, nothing to reason out from other parts of the data. However human language for example contains many redundancies. Imagine we are trying to compress English text and have a word such as "computer" on the input -- we can really just shorten it to "computr" and it's still pretty clear the word is meant to be "computer" as there is no other similar English word (we also see that compression algorithm is always specific to the type of data we expect on the input -- we have to know what nature of the input data we can expect). Another way to remove redundancy is to e.g. convert a string such as "HELLOHELLOHELLOHELLOHELLO" to "5xHELLO". Lossy compression on the other hand tries to decide what information is of low importance and can be dropped -- for example a lossy compression of text might discard information about case (upper vs lower case) to be able to store each character with fewer bits; an all caps text is still readable, though less comfortably.
**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compression program will compress better (but will only work with images). As an extreme example, consider that **in theory we can make e.g. an algorithm that compresses one specific 100GB video to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general -- i.e. we made an extremely specialized compression and got an extremely good compression ratio, however due to such extreme specialization we can almost never use it. As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get).
**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compression program will compress better (but will only work with images). As an extreme example, consider that **in theory we can make e.g. an algorithm that compresses one specific 100GB video to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general -- i.e. we made an extremely specialized compression and got an extremely good compression ratio, however due to such extreme specialization we can almost never use it. As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get). All in all, some compression [benchmarks](benchmark.md) can be found e.g. at https://web.archive.org/web/20110203152015/http://www.maximumcompression.com/index.html -- the following are mentioned types of data and their best measured compression ratios: English text 0.12, image (lossy) 0.76, executable 0.24.
## Methods
@ -70,6 +72,8 @@ In **video** compression we may reuse the ideas from image compression and furth
In **audio** we usually straight remove frequencies that humans can't hear (usually said to be above 20 kHz), for this we again convert the audio from spatial to frequency domain (using e.g. [Fourier transform](fourier_transform.md)). Furthermore it is very inefficient to store sample values directly -- we rather use so called *differential PCM*, a lossless compression that e.g. stores each sample as a difference against the previous sample (which is usually small and doesn't use up many bits). This can be improved by a predictor, which tries to predict the next values from previous values and then we only save the difference against this prediction. *Joint stereo coding* exploits the fact that human hearing is not so sensitive to the direction of the sound and so e.g. instead of recording both left and right stereo channels in full quality rather records the sum of both and a ratio between them (which can get away with fewer bits). *Psychoacoustics* studies how humans perceive sound, for example so called *masking*: certain frequencies may for example mask nearby (both in frequency and time) frequencies (make them unhearable for humans) so we can drop them. See also [vocoders](vocoder.md).
TODO: LZW, DEFLATE etc.
## Compression Programs/Utils/Standards
Here is a list of some common compression programs/utilities/standards/formats/etc:
@ -96,14 +100,14 @@ Let's write a simple lossless compression utility in [C](c.md). It will work on
The compression will work like this:
- We will choose some random, hopefully not very frequent byte value, as our special "marker value". Let's say this will be the value 0xf3.
- We will choose some random, hopefully not very frequent byte value, as our special "marker value". Let's say this will be the value 0xF3.
- We will read the input file and whenever we encounter a sequence of 4 or more same bytes in a row, we will output these 3 bytes:
- the marker value
- byte whose values is the length of the sequence minus 4
- the byte to repeat
- If we encounter the marker value is encountered in input, we output 2 bytes:
- If the marker value is encountered in input, we output 2 bytes:
- the marker value
- value 0xff (which we won't be able to use for the length of the sequence)
- value 0xFF (which we won't be able to use for the length of the sequence)
- Otherwise we just output the byte we read from the input.
Decompression is then quite simple -- we simply output what we read, unless we read the marker value; in such case we look whether the following value is 0xFF (then we output the marker value), else we know we have to repeat the next character this many times plus 4.
@ -213,4 +217,9 @@ int main(int argc, char **argv)
}
```
How well does this perform? If we try to let the utility compress its own source code, we get to 1242 bytes from the original 1344, which is not so great -- the compression ratio is only about 92% here. We can see why: the only repeating bytes in the source code are the space characters used for indentation -- this is the only thing our primitive algorithm manages to compress. However if we let the program compress its own binary version, we get much better results (at least on the computer this was tested on): the original binary has 16768 bytes while the compressed one has 5084 bytes, which is an EXCELLENT compression ratio of 30%! Yay :-)
How well does this perform? If we try to let the utility compress its own source code, we get to 1242 bytes from the original 1344, which is not so great -- the compression ratio is only about 92% here. We can see why: the only repeating bytes in the source code are the space characters used for indentation -- this is the only thing our primitive algorithm manages to compress. However if we let the program compress its own binary version, we get much better results (at least on the computer this was tested on): the original binary has 16768 bytes while the compressed one has 5084 bytes, which is an EXCELLENT compression ratio of 30%! Yay :-)
## See Also
- [procedural generation](procgen.md)
- [minification](minification.md)