This commit is contained in:
Miloslav Ciz 2024-03-30 22:07:31 +01:00
parent 24de580085
commit 6c86f68253
16 changed files with 1787 additions and 1743 deletions

View file

@ -33,7 +33,7 @@ The following is an example of how well different types of compression work for
Mathematically there cannot exist a lossless compression algorithm that would always reduce the size of any input data -- if it existed, we could just repeatedly apply it and compress ANY data to zero bytes. And not only that -- **every lossless compression will inevitably enlarge some input files**. This is also mathematically given -- we can see compression as simply mapping input binary sequences to output (compressed) binary sequences, while such mapping has to be one-to-one ([bijective](bijection.md)); it can be simply shown that if we make any such mapping that reduces the size of some input (maps a longer sequence to a shorter one, i.e. compresses it), we will also have to map some short code to a longer one. However we can make it so that our compression algorithm enlarges a file at most by 1 bit: we can say that the first bit in the compressed data says whether the following data is compressed or not; if our algorithm fails to reduce the size of the input, it simply sets the bit to says so and leaves the original file uncompressed (in practice many algorithms don't do this though as they try to work as streaming filters, without random access to data, which would be needed here).
**Dude, how does compression really work tho?** The basic principle of lossless compression is **removing [redundancy](redundancy.md)** ([correlations](correlation.md) in the data), i.e. that which is explicitly stored in the original data but doesn't really have to be there because it can be reasoned out from the remaining data. This is why a completely random [noise](noise.md) can't be compressed -- there is no correlated data in it, nothing to reason out from other parts of the data. However human language for example contains many redundancies. Imagine we are trying to compress English text and have a word such as "computer" on the input -- we can really just shorten it to "computr" and it's still pretty clear the word is meant to be "computer" as there is no other similar English word (we also see that compression algorithm is always specific to the type of data we expect on the input -- we have to know what nature of the input data we can expect). Another way to remove redundancy is to e.g. convert a string such as "HELLOHELLOHELLOHELLOHELLO" to "5xHELLO". Lossy compression on the other hand tries to decide what information is of low importance and can be dropped -- for example a lossy compression of text might discard information about case (upper vs lower case) to be able to store each character with fewer bits; an all caps text is still readable, though less comfortably.
**Dude, how does compression really work tho?** The basic principle of lossless compression is **removing [redundancy](redundancy.md)** ([correlations](correlation.md) in the data), i.e. that which is explicitly stored in the original data but doesn't really have to be there because it can be reasoned out from the remaining data. This is why a completely random [noise](noise.md) can't be compressed -- there is no correlated data in it, nothing to reason out from other parts of the data. However human language for example contains many redundancies. Imagine we are trying to compress English text and have a word such as "computer" on the input -- we can really just shorten it to "computr" and it's still pretty clear the word is meant to be "computer" as there is no other similar English word (we also see that compression algorithm is always specific to the type of data we expect on the input -- we have to know what nature of the input data we can expect). Another way to remove redundancy is to e.g. convert a string such as "HELLOHELLOHELLOHELLOHELLO" to "5xHELLO". Lossy compression on the other hand tries to decide what information is of low importance and can be dropped -- for example a lossy compression of text might discard information about case (upper vs lower case) to be able to store each character with fewer bits; an all caps text is still readable, though less comfortably. A deeper view of compression oftentimes leads to a realization that compression is really a problem of [artificial intelligence](ai.md), for compression is really about prediction and prediction is about understanding -- this is where state-of-the-art view stands.
{ A quick intuitive example: [encyclopedias](encyclopedia.md) almost always have at the beginning a list of abbreviations they will use in the definition of terms (e.g. "m.a. -> middle ages", ...), this is so that the book gets shorter and they save money on printing. They compress the text. ~drummyfish }