Update

2025-03-01 13:35:54 +01:00 · 2025-03-01 13:35:54 +01:00 · 5b6494e48c
commit 5b6494e48c
parent 521e728375
11 changed files with 1943 additions and 1921 deletions
--- a/compression.md
+++ b/compression.md
@ -1,16 +1,18 @@
 # Compression

-Compression means encoding [data](data.md) (such as images or texts) in a different way so that the data takes less space (memory) while keeping all the important [information](information.md), or, in plain terms, it usually means "making files smaller". Compression is pretty important so that we can utilize memory or [bandwidth](bandwidth.md) well -- without it our hard drives would be able to store just a handful of videos, internet would be slow as hell due to the gigantic amount of transferred data and our [RAM](ram.md) wouldn't suffice for things we normally do. There are many [algorithms](algorithm.md) for compressing various kinds of data, differing by their complexity, performance, efficiency of compression etc. The reverse process to compression (getting the original data back from the compressed data) is called **decompression**. The ratio of the compressed data size to the original data size is called **compression ratio** (the lower, the better). The science of data compression is truly huge and complicated AF, here we'll just mention some very basics. Also watch out: compression algorithms are often a [patent](patent.md) mine field.
+Compression means encoding [data](data.md) (such as images or texts) in a different way so that the data takes less space (memory) while keeping all the important [information](information.md), or, in plain terms, it usually means "making files smaller". Compression is pretty important so that we can utilize memory or [bandwidth](bandwidth.md) well -- without it our hard drives would be able to store just a handful of videos, internet would be slow as hell due to the gigantic amount of transferred data and our [RAM](ram.md) wouldn't suffice for things we normally do. There are many [algorithms](algorithm.md) for compressing various kinds of data, differing by their complexity, performance, efficiency of compression etc. The reverse process to compression (getting the original data back from the compressed data) is called **decompression**. The ratio of the original data size to compressed data size is called **compression ratio** (the higher, the better). The science of data compression is truly huge and complicated AF, here we'll just mention some very basics. Also watch out: compression algorithms are often a [patent](patent.md) mine field.

-{ I've now written a tiny LRS compression library/utility called [shitpress](shitpress.md), check it out at https://codeberg.org/drummyfish/shitpress. It's fewer than 200 LOC, so simple it can nicely serve educational purposes. The principle is simple, kind of a dictionary method, where the dictionary is simply the latest output 64 characters; if we find a long word that occurred recently, we simply reference it with mere 2 bytes. It works relatively well for most data! ~drummyfish }
+{ CORRECTION NOTE: Allow me to interject -- I used to have compression ratio defined here as compressed to original, then noticed it's usually defined as a reciprocal of that, corrected it now. There seems to be some general confusion though, some actually define it as "space saved", i.e. *1 - compressed / original*. Doesn't matter that much anyway, but it's probably better to stick to an established convention. ~drummyfish }
+
+{ I've now written a tiny LRS compression library/utility called [shitpress](shitpress.md). It's fewer than 200 LOC, so plain it can nicely serve educational purposes. The principle is simple, kind of a dictionary method, where the dictionary is simply the latest output 64 characters; if we find a long word that occurred recently, we simply reference it with mere 2 bytes. It works relatively well for most data! ~drummyfish }

 { There is a cool compressing competition known as Hutter Prize that offers 500000 pounds (no idea how much of normal money that is lol) to anyone who can break the current record for compressing [Wikipedia](wikipedia.md). Currently the record is at compressing 1GB down to 115MB. See http://prize.hutter1.net for more. ~drummyfish }

 { [LMAO](lmao.md) retard [patents](patent.md) are being granted on impossible compression algorithms, see e.g. http://gailly.net/05533051.html. See also [Sloot Digital Coding System](sloot.md), a miraculous compression algorithm that "could store a whole movie in 8 KB" lol. ~drummyfish }

-Let's keep in mind compression is not applied just to files on hard drives, it can also be used e.g. in RAM to utilize it more efficiently. [OpenGL](opengl.md) for instance offers the option to compress textures uploaded to the [GPU](gpu.md) to save space.
+We should also mention compression is not applied just to files on hard drives, it may just as well be used let's say in [RAM](ram.md) to utilize it more efficiently. [OpenGL](opengl.md) for instance offers the option to compress textures uploaded to the [GPU](gpu.md) to save space.

-Why don't we compress everything? Firstly because compressed data is slow to work with, it requires significant CPU time to compress and decompress data, it's a kind of a space-time tradeoff (we gain more storage space for the cost of CPU time). Compression also [obscures](obscurity.md) data, for example compressed text file will typically no longer be human readable, any code wanting to work with such data will have to include the nontrivial decompression code. Compressed data is also more prone to [corruption](corruption.md) because redundant information (which can help restoring corrupted data) is removed from it -- in fact we sometimes purposefully do the opposite of compression and make our data bigger to protect it from corruption (see e.g. [error correcting](error_correction.md) codes, [RAID](raid.md) etc.). And last but not least, many data can hardly be compressed or are so small it's not even worth it.
+Why don't we compress everything? Firstly because compressed data is slow to work with, it requires significant [CPU](cpu.md) time to compress and decompress data, it's a kind of a space-time tradeoff (we gain more storage space for the cost of CPU time). Compression also [obscures](obscurity.md) data, for example compressed text file will typically no longer be human readable, any code wanting to work with such data will have to include the nontrivial decompression code. Compressed data is also more prone to [corruption](corruption.md) because redundant information (which can help restoring corrupted data) is removed from it -- in fact we sometimes purposefully do the opposite of compression and make our data bigger to protect it from corruption (see e.g. [error correcting](error_correction.md) codes, [RAID](raid.md) etc.). And last but not least, many data can hardly be compressed or are so small it's not even worth it.

 The basic division of compression methods is to:

@ -26,10 +28,10 @@ The following is an example of how well different types of compression work for
 | compression                                         | ~size (KB)| ratio  |
 | --------------------------------------------------- | --------- | ------ |
 | none                                                | 3000      | 1      |
-| general lossless (lz4)                              | 396       | 0.132  |
-| image lossless (PNG)                                | 300       | 0.1    |
-| image lossy (JPG), nearly indistinguishable quality | 164       | 0.054  |
-| image lossy (JPG), ugly but readable                | 56        | 0.018  |
+| general lossless (lz4)                              | 396       | 7.57   |
+| image lossless (PNG)                                | 300       | 10     |
+| image lossy (JPG), nearly indistinguishable quality | 164       | 18.29  |
+| image lossy (JPG), ugly but readable                | 56        | 53.57  |

 Mathematically there cannot exist a lossless compression algorithm that would always reduce the size of any input data -- if it existed, we could just repeatedly apply it and compress ANY data to zero bytes. And not only that -- **every lossless compression will inevitably enlarge some input files**. This is also mathematically given -- we can see compression as simply mapping input binary sequences to output (compressed) binary sequences, while such mapping has to be one-to-one ([bijective](bijection.md)); it can be simply shown that if we make any such mapping that reduces the size of some input (maps a longer sequence to a shorter one, i.e. compresses it), we will also have to map some short code to a longer one. However we can make it so that our compression algorithm enlarges a file at most by 1 bit: we can say that the first bit in the compressed data says whether the following data is compressed or not; if our algorithm fails to reduce the size of the input, it simply sets the bit to says so and leaves the original file uncompressed (in practice many algorithms don't do this though as they try to work as streaming filters, without random access to data, which would be needed here).

@ -37,7 +39,7 @@ Mathematically there cannot exist a lossless compression algorithm that would al

 { A quick intuitive example: [encyclopedias](encyclopedia.md) almost always have at the beginning a list of abbreviations they will use in the definition of terms (e.g. "m.a. -> middle ages", ...), this is so that the book gets shorter and they save money on printing. They compress the text. ~drummyfish }

-**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compression program will compress better (but will only work with images). As an extreme example, consider that **in theory we can make e.g. an algorithm that compresses one specific 100GB video to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general -- i.e. we made an extremely specialized compression and got an extremely good compression ratio, however due to such extreme specialization we can almost never use it. As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get). All in all, some compression [benchmarks](benchmark.md) can be found e.g. at https://web.archive.org/web/20110203152015/http://www.maximumcompression.com/index.html -- the following are mentioned types of data and their best measured compression ratios: English text 0.12, image (lossy) 0.76, executable 0.24.
+**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compression program will compress better (but will only work with images). As an extreme example, consider that **in theory we can make e.g. an algorithm that compresses one specific 100GB video to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general -- i.e. we made an extremely specialized compression and got an extremely good compression ratio, however due to such extreme specialization we can almost never use it. As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get). All in all, some compression [benchmarks](benchmark.md) can be found e.g. at https://web.archive.org/web/20110203152015/http://www.maximumcompression.com/index.html -- the following are some approximate typical compression ratios: English text 8.33, image (lossy) 10, executable 4.

 ## Methods

@ -67,7 +69,7 @@ Furthermore also take a look at [procedural generation](procgen.md), a related t

 ### Lossy

-In lossy compression we generally try to limit information that is not very important and/or to which we aren't very sensitive, typically by dropping precision by [quantization](quantization.md), i.e. basically lowering the number of bits we use to store the "not so important" information -- in some cases we may just drop some information altogether (decrease precision to zero). Furthermore we finally also apply lossless compression to make the result even smaller.
+In lossy compression we generally try to discard information that is not very important and/or to which we aren't very sensitive, typically by dropping precision by [quantization](quantization.md), i.e. basically lowering the number of bits we use to store the "not so important" information -- in some cases we may just drop some information altogether (decrease precision to zero). Furthermore we finally also apply lossless compression to make the result even smaller.

 For **images** we usually exploit the fact that human sight is less sensitive to certain visual information, such as specific frequencies, colors, brightness etc. Common methods used here are:

@ -225,7 +227,7 @@ int main(int argc, char **argv)
 }
 ```

-How well does this perform? If we try to let the utility compress its own source code, we get to 1242 bytes from the original 1344, which is not so great -- the compression ratio is only about 92% here. We can see why: the only repeating bytes in the source code are the space characters used for indentation -- this is the only thing our primitive algorithm manages to compress. However if we let the program compress its own binary version, we get much better results (at least on the computer this was tested on): the original binary has 16768 bytes while the compressed one has 5084 bytes, which is an EXCELLENT compression ratio of 30%! Yay :-)
+How well does this perform? If we try to let the utility compress its own source code, we get to 1242 bytes from the original 1344, which is not so great -- the compression ratio is only about 1.08 here. We can see why: the only repeating bytes in the source code are the space characters used for indentation -- this is the only thing our primitive algorithm manages to compress. However if we let the program compress its own binary version, we get much better results (at least on the computer this was tested on): the original binary has 16768 bytes while the compressed one has 5084 bytes, which is an EXCELLENT compression ratio of 3.33%! Yay :-)

 ## See Also