From 11787adf26befb0bffa2a19299ba322d39af48ed Mon Sep 17 00:00:00 2001 From: Miloslav Ciz Date: Sun, 16 Jul 2023 12:50:44 +0200 Subject: [PATCH] Update --- compression.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/compression.md b/compression.md index cafc9bb..200c6ae 100644 --- a/compression.md +++ b/compression.md @@ -25,7 +25,7 @@ The following is an example of how well different types of compression work for | image lossy (JPG), nearly indistinguishable quality | 164 | 0.054 | | image lossy (JPG), ugly but readable | 56 | 0.018 | -**Every lossless compression will inevitably enlarge some input files**, i.e. it is mathematically impossible to make a lossless compressor which would make every input smaller than the original (if this was possible, we could just apply this compression over and over and reduce literally anything to 0 bytes). Why is this so? Imagine we are trying to compress data that may be up to 3 bits long -- then we are really looking for a way to map values to shorter values, e.g. *001 compresses to 01*, so that it is also possible to get the original value back from the latter value, i.e. *01 decompresses to 001*. This means each input value must uniquely map to one output value and vice versa (the mapping must be [bijective](bijection.md)), otherwise (if two or more input values mapped to the same output value) we couldn't know what value to later decompress. However this can't be done because there will always be fewer possible output value than input values as we are trying to map longer sequences to shorter (of which there are always fewer). In our case of 3 bits we have 13 possible input values (2 1bit values, 4 2bit values plus 8 3bit values) but only 6 output values (2 1bit values plus 4 2bit values), simply because the output values cannot be longer than 2 bits. Hence we are left with no other option than to map some input values to longer output values. +Mathematically there cannot exist a lossless compression algorithm that would always reduce the size of any input data -- if it existed, we could just repeatedly apply it and compress ANY data to zero bytes. And not only that -- **every lossless compression will inevitably enlarge some input files**. This is also mathematically given -- we can see compression as simply mapping input binary sequences to output (compressed) binary sequences, while such mapping has to be one-to-one ([bijective](bijection.md)); it can be simply shown that if we make any such mapping that reduces the size of some input (maps a longer sequence to a shorter one, i.e. compresses it), we will also have to map some short code to a longer one. However we can make it so that our compression algorithm enlarges a file at most by 1 bit: we can say that the first bit in the compressed data says whether the following data is compressed or not; if our algorithm fails to reduce the size of the input, it simply sets the bit to says so and leaves the original file uncompressed. **Dude, how does compression really work tho?** The basic principle of lossless compression is **removing [redundancy](redundancy.md)** ([correlations](correlation.md) in the data), i.e. that which is explicitly stored in the original data but doesn't really have to be there because it can be reasoned out from the remaining data. This is why a completely random [noise](noise.md) can't be compressed -- there is no correlated data in it, nothing to reason out from other parts of the data. However human language for example contains many redundancies. Imagine we are trying to compress English text and have a word such as "computer" on the input -- we can really just shorten it to "computr" and it's still pretty clear the word is meant to be "computer" as there is no other similar English word (we also see that compression algorithm is always specific to the type of data we expect on the input -- we have to know what nature of the input data we can expect). Another way to remove redundancy is to e.g. convert a string such as "HELLOHELLOHELLOHELLOHELLO" to "5xHELLO". Lossy compression on the other hand tries to decide what information is of low importance and can be dropped -- for example a lossy compression of text might discard information about case (upper vs lower case) to be able to store each character with fewer bits; an all caps text is still readable, though less comfortably.