This commit is contained in:
Miloslav Ciz 2025-07-01 23:56:20 +02:00
parent 5e2ef4f6db
commit 244dfff2a8
11 changed files with 2025 additions and 2007 deletions

View file

@ -31,9 +31,11 @@ What follows is an example of how well different types of compression work for a
| --------------------------------------------------- | --------- | ------ |
| none | 3000 | 1 |
| general lossless (lz4) | 396 | 7.57 |
| image lossless (PNG) | 300 | 10 |
| image lossy (JPG), nearly indistinguishable quality | 164 | 18.29 |
| image lossy (JPG), ugly but readable | 56 | 53.57 |
| general lossless (gzip) | 308 | 9.74 |
| image lossless (PNG) | 288 | 10.41 |
| image lossless (WEBP) | 176 | 17.04 |
| image lossy (JPG), good quality (75%) | 148 | 20.27 |
| image lossy (JPG), ugly but readable (15%) | 60 | 50 |
Mathematically there cannot exist a lossless compression algorithm that would always reduce the size of any input data -- if it existed, we could just repeatedly apply it and compress ANY data to zero bytes. And not only that -- **every lossless compression will inevitably enlarge some input files**. This is also mathematically given -- we can see compression as simply mapping input binary sequences to output (compressed) binary sequences, while such mapping has to be one-to-one ([bijective](bijection.md)); it can be simply shown that if we make any such mapping that reduces the size of some input (maps a longer sequence to a shorter one, i.e. compresses it), we will also have to map some short code to a longer one. However we can make it so that our compression algorithm enlarges a file at most by 1 bit: we can say that the first bit in the compressed data says whether the following data is compressed or not; if our algorithm fails to reduce the size of the input, it simply sets the bit to says so and leaves the original file uncompressed (in practice many algorithms don't do this though as they try to work as streaming filters, without random access to data, which would be needed here).
@ -41,7 +43,7 @@ Mathematically there cannot exist a lossless compression algorithm that would al
{ A quick intuitive example: [encyclopedias](encyclopedia.md) almost always have at the beginning a list of abbreviations they will use in the definition of terms (e.g. "m.a. -> middle ages", ...), this is so that the book gets shorter and they save money on printing. They compress the text. ~drummyfish }
**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compression program will compress better (but will only work with images). As an extreme example, consider that **in theory we can make e.g. an algorithm that compresses one specific 100GB video to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general -- i.e. we made an extremely specialized compression and got an extremely good compression ratio, however due to such extreme specialization we can almost never use it. As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get). All in all, some compression [benchmarks](benchmark.md) can be found e.g. at https://web.archive.org/web/20110203152015/http://www.maximumcompression.com/index.html -- the following are some approximate typical compression ratios: English text 8.33, image (lossy) 10, executable 4.
**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we implicitly know about the nature of the compressed data, the more we can compress it, and this makes very good sense -- that which we already know we don't have to encode and thus the more we know, the less data there has to be (the smaller the compressed file), but also the more we become limited in what we can compress. So a general compression program will compress only a little while an image-specialized compression program will compress better (but will only work with images). If we specifically focus only on compressing English text for instance, we can assume it will only consist of words in the English language and so the compressed text doesn't have to come with English dictionary, but we also won't be able to compress [Chinese](chinese.md) text as a result. For an extreme example consider that **in theory we can make an algorithm that compresses one specific 100GB video down to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general -- i.e. we made an extremely specialized compression and got an extremely good compression ratio, however due to such extreme specialization we can almost never use it. As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get). All in all, some compression [benchmarks](benchmark.md) can be found e.g. at https://web.archive.org/web/20110203152015/http://www.maximumcompression.com/index.html -- the following are some approximate typical compression ratios: English text 8.33, image (lossy) 10, executable 4.
## Methods