java.util.zip and Perl's Compress::Zlib::compress Header missmatch

java.util.zip and Perl's Compress::Zlib::compress Header missmatch

Post by George K » Thu, 08 Apr 2010 06:15:48


Hello all,

I am not sure whether this is the right place for this question but I
am willing to give it a try.

The problem is that when I compress a string in memory, the java and
perl versions of the compressed string are different, in the second
byte of the resulting compressed string--please note I used
compression level 0--i.e. no compression so I expect the zlib header
along with the string. Here's the resulting compressed hex values as
seen in emacs's hexl-mode--the difference is in the second byte--the
rest is the same.

Java:
78da 0105 00fa ff41 4141 4141 03d4 0146

Perl:
7801 0105 00fa ff41 4141 4141 03d4 0146

My system is:
Linux athina 2.6.30.3 #3 SMP Tue Jul 28 11:56:46 PDT 2009 i686
Intel(R) Core(TM)2 Duo CPU T7250 @ 2.00GHz GenuineIntel GNU/Linux

This is perl, v5.6.2 built for i686-linux

java version "1.6.0_14"
Java(TM) SE Runtime Environment (build 1.6.0_14-b08)
Java HotSpot(TM) Server VM (build 14.0-b16, mixed mode)

Here's the sample programs to try them yourself if you would like:

=============================================
import java.util.zip.*;
import java.io.*;

class TestCompress {

private String s = "AAAAA";

public static void main(String [] args) {

// Encode a String into bytes
TestCompress tc = new TestCompress();
String inputString = tc.s;
byte[] input = inputString.getBytes();

System.out.println(tc.s);

System.out.println("SIZE: "+input.length);

// Compress the bytes
System.out.println("STEP 1: Compressing...");
byte[] output = new byte[500];
Deflater compresser = new Deflater(0);
compresser.setInput(input);
compresser.finish();
int compressedDataLength = compresser.deflate(output);
System.out.println("...yields...");
System.out.println(new String(output, 0, compressedDataLength));
System.out.println("SIZE:"+compressedDataLength);

try {
// Create file
FileWriter fstream = new FileWriter("java_compress.zip");
BufferedWriter out = new BufferedWriter(fstream);
out.write(new String(output, 0, compressedDataLength));
//Close the output stream
out.close();
} catch (java.io.IOException e) {
e.printStackTrace();
}
}

}
===============================================

#!/usr/bin/env perl
use strict;
use warnings;
use English;

use Carp qw();
use Compress::Zlib qw();

eval {
main(@ARGV);
};
if($EVAL_ERROR) {
print STDERR "Exception raised: ". $EVAL_ERROR . "\n";
exit 1;
}

exit 0;


sub main {

my ($s) = @ARG;

if(! $s) {
$s="AAAAA";
}
print $s;
print "SIZE: " . length($s) . "\n";
print "Step 1: Compressing...\n";
my $sCompressed = Compress::Zlib::compress($s,0);
print "...yields...\n";
print $sCompressed . "\n";
print "SIZE: " . length($sCompressed) . "\n";

open(OUT, ">", "perl_compressed.zip");
print OUT $sCompressed;
close(OPEN);
}
==============================================

Thank you in advance,

George
 
 
 

java.util.zip and Perl's Compress::Zlib::compress Header missmatch

Post by Thomas Por » Thu, 08 Apr 2010 21:33:52

According to George K. < XXXX@XXXXX.COM >:

That's just the header. See RFC 1950 for the Zlib header, and
RFC 1951 for the Deflate format. With details:

78da 0105 00fa ff41 4141 4141 03d4 0146

splits into:

78da
Zlib header:
78 compression method (8) and compression info (7). Method 8
is "Deflate" and 7 means "... with a 32K window".
da flags

0105 00fa ff41 41 41 41 41
the compressed data:
01 a "not-compressed" block
0500 the block length (5 bytes, little-endian)
faff one's complement of the block length
... then the block data, i.e. your five "A"

03d4 0146
cheksum on the compressed data

So your two compressed strings differ only in the flags.
The five low bits of the flags are actually a checksum on the 11 other bits
of the header, so what really changes is the set of three most significant
bits of the flag byte. You have "110" for the Java compression, and
"000" for the Perl compression. The first two bits are the "compression
level", the third bit indicated whether there is a dictionary or not.

So they differ only by the "compression level". "11" means "I tried real
hard to compress" and "00" means "I used the fastest algorithm I know
of". This has _no bearing whatsoever_ on the decompression process. The
two strings are equally valid Zlib strings, and they both decompress to
the same "AAAAA" data. Those two bits are merely a kind of propaganda:
Java claims "I am a very good compressor: I compress data into really
small bits", while Perl says "I am a very good compressor: I do my job
really fast".



More generally speaking, decompression with Deflate is deterministic,
but not compression. The compressor has a fair amount of choice in how a
given input data is to be compressed. You may expect such differences
when not using the exact same version of the exact same software.


--Thomas *** in