This commit is contained in:
Miloslav Ciz 2023-07-16 00:02:11 +02:00
parent 03f1dcaca6
commit f63d3b7557

View file

@ -1,12 +1,12 @@
# Compression
Compression means encoding [data](data.md) (such as images or texts) in a different way so that it takes less storage memory while keeping all the important [information](information.md), or, in plain terms, it usually means "making files smaller". Compression is pretty important so that we can utilize memory well -- without it our hard drives would be able to store just a handful of videos, internet would be slow as hell due to the gigantic amount of transferred data and our [RAM](ram.md) wouldn't suffice for things we normally do. There are many [algorithms](algorithm.md) for compressing various kinds of data, differing by their complexity, performance, efficiency of compression etc. The reverse process to compression (getting the original data back from the compressed data) is called **decompression**. The ratio of the compressed data size to the original data size is called **compression ration** (the lower, the better). The science of data compression is truly huge and complicated AF, here we'll just mention some very basics.
Compression means encoding [data](data.md) (such as images or texts) in a different way so that it takes less storage memory while keeping all the important [information](information.md), or, in plain terms, it usually means "making files smaller". Compression is pretty important so that we can utilize memory well -- without it our hard drives would be able to store just a handful of videos, internet would be slow as hell due to the gigantic amount of transferred data and our [RAM](ram.md) wouldn't suffice for things we normally do. There are many [algorithms](algorithm.md) for compressing various kinds of data, differing by their complexity, performance, efficiency of compression etc. The reverse process to compression (getting the original data back from the compressed data) is called **decompression**. The ratio of the compressed data size to the original data size is called **compression ratio** (the lower, the better). The science of data compression is truly huge and complicated AF, here we'll just mention some very basics.
{ There is a cool compressing competition known as Hutter Prize that offers 500000 pounds to anyone who can break the current record for compressing [Wikipedia](wikipedia.md). Currently the record is at compressing 1GB down to 115MB. See http://prize.hutter1.net for more. ~drummyfish }
Let's keep in mind compression is not applied just to files on hard drives, it can also be used e.g. in RAM to utilize it more efficiently.
Why don't we compress everything? Firstly because compressed data is slow to work with, it requires significant CPU time to compress and decompress data, it's a kind of a space-time tradeoff (we gain more storage space for the cost of CPU time). Secondly compressed data is more prone to [corruption](corruption.md) because redundant information (which can help restoring corrupted data) is removed from it -- in fact we sometimes purposefully do the opposite of compression and make our data bigger on purpose to protect it from corruption (see e.g. [error correcting](error_correction.md) codes, [RAID](raid.md) etc.). And last but not least, many data can hardly be compressed or are so small it's not even worth it.
Why don't we compress everything? Firstly because compressed data is slow to work with, it requires significant CPU time to compress and decompress data, it's a kind of a space-time tradeoff (we gain more storage space for the cost of CPU time). Secondly compressed data is more prone to [corruption](corruption.md) because redundant information (which can help restoring corrupted data) is removed from it -- in fact we sometimes purposefully do the opposite of compression and make our data bigger to protect it from corruption (see e.g. [error correcting](error_correction.md) codes, [RAID](raid.md) etc.). And last but not least, many data can hardly be compressed or are so small it's not even worth it.
The basic division of compression methods is to:
@ -29,7 +29,7 @@ The following is an example of how well different types of compression work for
**Dude, how does compression really work tho?** The basic principle of lossless compression is **removing [redundancy](redundancy.md)** ([correlations](correlation.md) in the data), i.e. that which is explicitly stored in the original data but doesn't really have to be there because it can be reasoned out from the remaining data. This is why a completely random [noise](noise.md) can't be compressed -- there is no correlated data in it, nothing to reason out from other parts of the data. However human language for example contains many redundancies. Imagine we are trying to compress English text and have a word such as "computer" on the input -- we can really just shorten it to "computr" and it's still pretty clear the word is meant to be "computer" as there is no other similar English word (we also see that compression algorithm is always specific to the type of data we expect on the input -- we have to know what nature of the input data we can expect). Another way to remove redundancy is to e.g. convert a string such as "HELLOHELLOHELLOHELLOHELLO" to "5xHELLO". Lossy compression on the other hand tries to decide what information is of low importance and can be dropped -- for example a lossy compression of text might discard information about case (upper vs lower case) to be able to store each character with fewer bits; an all caps text is still readable, though less comfortably.
**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compress program will compress better (but will only work with images). As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. **In theory we can make an algorithm that compresses one specific 1000GB video to 1 bit** (we just define that a bit "1" uncompresses to this specific video), but it will only work for that one single video, not for video in general. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get).
**OK, but how much can we really compress?** Well, as stated above, there can never be anything such as a universal uber compression algorithm that just makes any input file super small -- everything really depends on the nature of the data we are trying to compress. The more we know about the nature of the input data, the more we can compress, so a general compression program will compress only a little, while an image-specialized compress program will compress better (but will only work with images). As said, we just cannot compress completely random data at all (as we don't know anything about the nature of such data). On the other hand data with a lot of redundancy, such as video, can be compressed A LOT. **In theory we can make an algorithm that compresses one specific 100GB video to 1 bit** (we just define that a bit "1" decompresses to this specific video), but it will only work for that one single video, not for video in general. Similarly video compression algorithms used in practice work only for videos that appear in the real world which exhibit certain patterns, such as two consecutive frames being very similar -- if we try to compress e.g. static (white noise), video codecs just shit themselves trying to compress it (look up e.g. videos of confetti and see how blocky they get).
## Methods