Update

2024-01-07 02:43:35 +01:00 · 2024-01-07 02:43:35 +01:00 · 0ea738f524
commit 0ea738f524
parent 9635bbfa85
12 changed files with 26 additions and 17 deletions
--- a/compression.md
+++ b/compression.md
@ -35,6 +35,8 @@ Mathematically there cannot exist a lossless compression algorithm that would al

 **Dude, how does compression really work tho?** The basic principle of lossless compression is **removing [redundancy](redundancy.md)** ([correlations](correlation.md) in the data), i.e. that which is explicitly stored in the original data but doesn't really have to be there because it can be reasoned out from the remaining data. This is why a completely random [noise](noise.md) can't be compressed -- there is no correlated data in it, nothing to reason out from other parts of the data. However human language for example contains many redundancies. Imagine we are trying to compress English text and have a word such as "computer" on the input -- we can really just shorten it to "computr" and it's still pretty clear the word is meant to be "computer" as there is no other similar English word (we also see that compression algorithm is always specific to the type of data we expect on the input -- we have to know what nature of the input data we can expect). Another way to remove redundancy is to e.g. convert a string such as "HELLOHELLOHELLOHELLOHELLO" to "5xHELLO". Lossy compression on the other hand tries to decide what information is of low importance and can be dropped -- for example a lossy compression of text might discard information about case (upper vs lower case) to be able to store each character with fewer bits; an all caps text is still readable, though less comfortably.

+{ A quick intuitive example: [encyclopedias](encyclopedia.md) almost always have at the beginning a list of abbreviations they will use in the definition of terms (e.g. "m.a. -> middle ages", ...), this is so that the book gets shorter and they save money on printing. They compress the text. ~drummyfish }
+
 **OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compression program will compress better (but will only work with images). As an extreme example, consider that **in theory we can make e.g. an algorithm that compresses one specific 100GB video to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general -- i.e. we made an extremely specialized compression and got an extremely good compression ratio, however due to such extreme specialization we can almost never use it. As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get). All in all, some compression [benchmarks](benchmark.md) can be found e.g. at https://web.archive.org/web/20110203152015/http://www.maximumcompression.com/index.html -- the following are mentioned types of data and their best measured compression ratios: English text 0.12, image (lossy) 0.76, executable 0.24.

 ## Methods