Recompressing poorly compressed files

Recompressing poorly compressed files

Post by aguydud » Sat, 06 Dec 2008 11:59:59


es, you can compress already compressed files. If the initial
compression algorithm was not very good, and the 2nd one is very
good. You normally won't get a smaller file doing it this way than
just compressing with a good algorithm in the first place of course.
If I thought I could somehow compress things over and over again, I'd
be pelted with the pigeonhole principal. But starting from a poorly
compressed file and somehow getting out a smaller file (e.g. by
decompressing it and recompressing it) probably is not controversial,
particularly if you know which algorithm was used to do the initial
compression.

The idea below seems kind of obvious, but I've never seen it, so feel
free to point out where it's been done.

Let us suppose I'll come up with 65535 algorithms, as well as the null
(do nothing) algorithm. For what I'm about to discuss. Yay, all my
files are now going to be no more than an extra 2 bytes long to
implement the idea I'm about to discuss (ones for which this idea
fails also get an extra two bytes). Any time I encounter an algorithm
that sometimes manages to make a compressed file be twice as big as
the uncompressed file, I wonder why they didn't just make all their
compressed files 1 byte bigger (or one bit. They can use the other 7
bits in some cases) so they could add a flag to their algorithm to
indicate if it worked or not. Yes, I know failed compression needs to
account for things like filenames, attributes, etc, so you'll probably
increase size by more than 1 byte. But that is no excuse for doubling
your file size when your algorithm fails. :::Glares at RLE:::

Anyhow, let's go crazy and imagine that we really don't mind bloated
algorithms for compression and decompression, or even if the're a bit
slow. Though I can imagine a quick and dirty, incomplete but
practical implementation of this solution which increased the time
requirements of this algorithm by A and the memory requirements by B.
A being something under 3 * (decompress time for crappy, slow
algorithm + compress time for crappy, slow algorithm) and B being
something under the total size of GIMP and the 7-zip program + 5MB or
so.

Anyhow, the new algorithm will work as follows. Any time it detects
data that it recognizes as being compressed, it decompresses it before
running it's favorite algorithm (or trying a few if you want it to run
slower). It applies this decompression step recursively if
necessary. In the case that parts of this decompression are not
reversible (e.g. 'decompressing' a gif into a bitmap means losing some
headers), delta code is added after the header (yes, fine, we need 1
more bit to indicate if the delta is there or not, but only if the
compression works).

A practical example is this. If I have a CHM file full of gifs and
try to zip it, nothing happens. If I decompile it, replace all the
gifs with either bitmaps or pngs containing exactly the same colors at
each pixel and then I zip it, I end up with a much smaller file, even
though the data itself is sufficient to regenerate the original chm
file...or something similar enough that a delta won't be nearly as big
as this size difference. I tried this the other day with a 600KB chm
and got a working 500KB chm (it wasn't the same chm, it just looked
the same). That makes me sad. Zipping my chm file into a smaller
file should work. Zipping 20 versions of my chm file, each with
slight variations in their co
 
 
 

Recompressing poorly compressed files

Post by schnaade » Sat, 06 Dec 2008 17:09:56

This idea is used for Precomp. Have a look at http://www.yqcomputer.com/
The main algorithm used for decompression/recompression used there is
zLib/deflate, which is used in many common files (PDF/ZIP/JAR/SWF/PNG/
ODT/GZ/SVGZ/SIS...), but GIF recompression using LZW and JPG
recompression using packJPG is implemented, too.
The difficult part is not to get the data decompressed, but to get it
recompressed bit-to-bit identical. A lossy attempt would also be
possible, but this could lead to errors if meta-data exists, e.g. a
container file that contains information about the zLib streams.
That's why CHM isn't implemented yet, as it's hard to get the same
file from recompression.

Greetings,
schnaader

 
 
 

Recompressing poorly compressed files

Post by Thomas Ric » Sun, 07 Dec 2008 04:25:18

XXXX@XXXXX.COM wrote:

If you have a better compressor, yes.


Sure enough.


No, it's surely possible. However, the combined compressor will neither
be able to compress all files. It will also expand some files, but
likely some other files than the individual compressors.


Which is fine.


A couple of compressors actually work that way. ZIP can store files
uncompressed, for example. Of course the reason why not contradicting
the pigeonhole principle is that you need to store the information that
the file wasn't compressed, so the file does get (a bit - one bit) larger.




It is also practical for compiled (object) code. There is surely
redundancy in those files, some opcodes are rather frequent (mov) and
others are relatively rare, so you can definitely compress them (on
average, with a good model).


You are moving into the direction of Kolmogorov complexity. It is
defined as the size of the shortest program that generates a given
output file. This definition makes only sense in an asymptotic way, i.e.
for file sizes going to infinity, but then the language how to encode
the program does no longer matter. Unfortunately, one can also show that
there is no way how to compute this complexity.


Well, why? The proposed mechanism would work, nothing speaks against it.
It is probably not very practical, but that's a problem a mathematician
cannot argue about. Except, that you haven't thought this to the very
end, namely instead of encoding the compressor(s) you used, come up with
the shortest possible program that generated the output in first place,
i.e. the compressor for Pi is a program that generates all digits, and
thus, this infinite digit number can be compressed to a finite string.
Similar with pseudo-random generator output: They can be compressed to
the program size of the random generator, so they are not random (thus,
"pseudo").

So long,
Thomas