creating identical zip archives with java and zip tools

Post by m.niinimak » Wed, 25 Aug 2010 19:25:20


a simple problem: we have JavaEE based server that accepts ZIP files,
and a JavaEE based client that creates and uploads them. We verify
that the ZIP file is identical at client and server by MD5.

I'd like to write a client with some other language.. in fact even a
shell/wget script would do. But here's the problem: if I create a ZIP
archive with, it is not the identical to an archive
created by command line tools. Here's a simple example:

cat hello.txt
zip hello.txt
java makezip
[creates, code below]
ls -l

-rw-r--r-- 1 x x 174 2010-08-24 12:23
-rw-r--r-- 1 x x 140 2010-08-24 12:24

Is there a way of forcing compatibility on either the zip tool (on
Linux) or Java?


public class makezip {
public static void main(String[] args) {
String file_to_add = "hello.txt";
byte[] buf = new byte[1024];
try {
String outFilename = "";
ZipOutputStream out = new ZipOutputStream(new
FileInputStream in = new FileInputStream(file_to_add);
// Add ZIP entry to output
out.putNextEntry(new ZipEntry(file_to_add));
int len;
while ((len = > 0) { out.write(buf, 0,
len); }
out.closeEntry(); in.close();
} catch (IOException e) { }

Post by Lew » Wed, 25 Aug 2010 20:20:11

You have to set the compression to be the same for both. You can't even
guarantee that client and server get the same result if they use the same tool

Regardless, you're doing it wrong. You don't repeat the zip process on the
server, you repeat the MD5 calculation. Then you don't care if they use your
custom Java, WinZIP, jar, arc or what-have-you. The server compares the MD5
hash (or whatever hash you choose) that it calculates to the one provided by
the client.

Think about how you verify the hash of downloads. You don't recompress
things, you compare the hash value they provide to one you calculate.

You don't need compatible zip implementations. You don't even need to use the
same compressor. You don't even need to compress the files at all. Whichever
route you take, you can successfully compare well-defined hash values like MD5.



Post by Screamin L » Wed, 25 Aug 2010 23:44:30



Maybe the server can change the zip contents. In that case zip files
itself must be the same on both client and server if their contents are
the same, of course, or hashing must be done on the contents itself.

Post by BGB / cr88 » Thu, 26 Aug 2010 00:23:02

ZIP+Deflate is internally non-trivial, and apart from validating the
decompressed data, what is required can't actually be done in a general
sense (unless of course, all parties involved have to use the same
implementation and version of both the deflater and the zip code, ...).

basically, it is analogous to running the same source code through several
different compilers, and expecting the exact same results in each case.

the partial reason is that deflate is not just some simple transform of the
input data to the output compressed data, but actually involves a fair
amount of internal pattern matching and heuristics, and typically between
implementations there will be many minor variations in pattern-matching and
heuristic behaviors.

examples of variations:
greedy vs non-greedy strings matching (always match longest run up-front, or
allow a shorter match if the compressor guesses this will lead to a longer
match later, ...);
how far to search back backwards, and the exact string lengths to check for
along the way, ... (such as due to performance tradeoffs, where always doing
max depth at max length will tend to be a little slow, especially in
non-greedy implementations which may do a lot of extra matches in searching
for the "best" strings to match, ...).

within ZIP, there is also the matters of exact field settings, ...

the result then is that there tends to be some amount of internal variation
between compressed files.

Post by Screamin L » Thu, 26 Aug 2010 01:58:26

I suppose I wasn't clear enough.

Consider this case:

Client sends A.ZIP to server.
Client puts file X.TXT to A.ZIP

Server puts the file X.TXT (same data) to the received A.ZIP (so it must
recompress it).

Client sends new A.ZIP again (or its hash)

Server sees hashes are different (because compression differs from
client to server) and concludes the files must be different, which is
true for zip files, but the files inside those zips are in fact the same.

So, in this case the solution would be the hashing of the uncompressed
contents of a zip in a reproducible fashion, and sending that hash
instead of the zip file hash.

Post by Lew » Thu, 26 Aug 2010 02:40:03

No, the solution is to send the hash of the new zip file (with X.TXT
included) along with the new zip file and have the other end confirm
that the hash of its received file calculates to the same value.

If the "Client sends new A.ZIP again" it needs to send the hash with
it. You don't duplicate the zip on both sides! You duplicate the
calculation of the hash!

This nonsense about creating the same changes on two sides is rococo
to the extreme.

Post by Screamin L » Thu, 26 Aug 2010 03:38:23

Which it won't be in the presented case. In any other case I would first
send only hash (with some id of the file) and let server decide if it
has that unchanged file. If it already has it, there is no need to send
it again.

OP did say that he needs to check if the files are the same both on the
client and on the server. Why would he want to do that I don't know. I
just provided one case in which exact result of the compression
algorithm matters -- same contents - different hashes (which is what
bothered him in the first place).

Your suggestion (which is the usual and quite obvious way you would do
it if you didn't have compressed files that might change its contents on
both sides) doesn't work in that case, as much as rococo nonsense that
case might be.

Post by Roedy Gree » Fri, 27 Aug 2010 12:43:55

On Tue, 24 Aug 2010 03:25:20 -0700 (PDT), "m.niinimaki"
< XXXX@XXXXX.COM > wrote, quoted or indirectly quoted someone who
said :

This is true of ANY two command line or library tools. The best you
can hope for is you get the same contents when you fluff them back up.

Each utility is using its proprietary tweaks to the compression.

Further, Java fails to fill in all the indexing fields.

If you need binary identicality, you will need to run your zipper via
the commandline/exec interface. See

bit fancier than the one that comes bundled.

Roedy Green Canadian Mind Products

You encapsulate not just to save typing, but more importantly, to make it easy and safe to change the code later, since you then need change the logic in only one place. Without it, you might fail to change the logic in all the places it occurs.

Post by step » Mon, 30 Aug 2010 19:13:37

zip format already contains a checksum for each file in the archive.
this checksums are verified by unzip tools.
so if server get the hash - such as SHA1 - of the zip and verify it;
then, if ok, it unzip the content, it will be sure the files are the
same as on the client.