Doom9's Forum - View Single Post

LoRd_MuldeR · 29th February 2016, 22:30

Quote:

Originally Posted by feisty2

but I guess π could help here, since any random base 10 data will appear somewhere within the digits of π like eventually, an extra compression could be done by the following steps:
1. convert the binary to many base 10 data blocks
2. search the data block in the digits of π, if any match found, mark the matching digit position a1 and the length of data block a2
3. if the size of a1+a2 >= the data block itself, abort the rest steps and start the next data block matching, if not, replace the whole data block with a1 and a2
4. repeat step 1-3 till all data blocks are matched

what do you think?

What you describe here is essentially "dictionary compression" (cf. LZ77). And pretty much any general purpose lossless compressor uses this in one way or another

I think the most important difference between your idea and how dictionary compression is implemented in practice is that you are using a fixed dictionary (i.e. the digits of π), while practical implementations use a dynamic dictionary. The latter means that the dictionary that is being built/updated while the data is being compressed/decompressed. Most often, a "sliding window" buffer, containing the last N symbols that have been compressed/decompressed, is used here.

Another difference is that your dictionary has an infinite size (i.e. the digits of π), while piratical application use a bounded size dictionary (e.g. the last N symbols that have been processed). Overall, I think your idea performs worse than the algorithms use in practice, because (a) your dictionary does not adapt to the input data and (b) because of that you will need a VERY large "a1" to find a match - which means "a1+a2" will almost always be bigger than the data block itself.

Quote:

Originally Posted by ggf31416

You can't losslessly compress random data (in average). A simplified argument goes like this:
Take all 2^n possible different files of size n. Compress them with your random compressor by 1-bit. Now you have 2^n files of size n-1, but there can be only 2^(n-1) different files of size n-1, then you can't decompress correctly half of your files.

This is called "Schubfachprinzip" or "Pigeonhole principle". It means that if you have exactly N places where you can put something, but you have more than N items to be stored, then at least one place unavoidably ends up containing more than one item! This explains why a "perfect" compression, that makes every possible input (file) smaller, can not exist in reality. Or, in other words, if the compressor makes some inputs (files) smaller, there also have to be some files that become bigger.

If a "perfect" compressor (which makes every input smaller by at least one byte) did exist, you could apply it recursively again and again, until eventually every input gets compress to a single byte (or even zero byte?). Clearly makes no sense