Help on discovering Compression algorithm

Help on discovering Compression algorithm

Post by hddsit » Tue, 07 Jun 2005 06:06:23


Hi,

I have a compressed file (ROM) and I'm trying to discover what
algorithm was used to compress it. This is the first part of the
compressed data

Position : 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12
Uncompressed stream :43 6F 70 79 72 69 67 68 74 20 31 39 38 35 2D 31 39
38 38
Uncompressed ASCII :C o p y r i g h t 1 9 8 5 - 1 9
8 8
Compressed Stream :43 6F 70 79 72 69 67 68 74 20 31 39 38 35 2D 02 04
F0 0F 38
Compressed ASCII :C o p y r i g h t 1 9 8 5 - . .
? . 8

The original text is "Copyright 1985-1988", the compressed version is
"Copyright 1985 -" and the bytes : 0x02 0x04 0xF0 0x0F 0x38

I discovered that 0x02 is the length of the compressed string
-1("198"), it is compressed beacuse this string was already present in
"1985". 0x04 is the position of "198" in "1985" relative to the current
position -1. In other words, 0x02 0x04 means that in that position
should be written 3 bytes that were already present in the stream 5
(4+1) bytes back.

What I dont know yet is what "0xF0 0x0F" is.

Another cases in the same file:

29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
4C 74 64 2E 20 20 20 43 6F 70 79 72 69 67 68 74 20 31 39 38 38 2D 32
L t d . C o p y r i g h t 1 9 8 8 - 2
4C 74 64 2E 20 20 20 0C 2F A0 38 2D 32
L t d . . / 8 - 2

in this case "Copyright 198" was changed to 0x0C 0x2F 0xA0. 0x0C is the
length -1, 0x2F the relative position -1, 0xA0?

Another one,

63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 71 72
20 72 69 67 68 74 73 20 72 65 73 65 72 76 65 64
r i g h t s r e s e r v e d
04 5F B0 73 20 72 65 73 65 72 76 65 64
? s r e s e r v e d

0x04 = length -1
0x5F = relative position
0xB0?


Any idea will be really appreciated..

Thank you for your help!

hdd
 
 
 

Help on discovering Compression algorithm

Post by Jacek Wawr » Tue, 07 Jun 2005 09:03:25


I think the encoding could be as follows:

8 bits (= 0x02) -> length of string - 1
12 bits (= 0x004) -> distance to string - 1
4 bits (= 0xf) -> number of uncompressed bytes following the string
(0..14),
or value 0xf which means that there are more than 14 uncompressed bytes
8 bits (=0x0f) -> optional number of uncompressed bytes above 14
(value of 0 means 15, so in this case there are 30 uncompressed bytes),
or value 0xff which means that there are more than 269 uncompressed bytes
following the string

and so on...


By the way, I guess the original text sounds like:

"Copyright 1985-1988 Phoenix Technologies Ltd. Copyright
1988-2000 Dell Computer Corporation. All rights reserved."

isn't it? ;)


Jacek Wawrzaszek

 
 
 

Help on discovering Compression algorithm

Post by hddsit » Tue, 07 Jun 2005 16:17:27

Thank you.. I wil check that..
Yep ;) that's the original string ;)..

Thank you very much!

hdd
 
 
 

Help on discovering Compression algorithm

Post by Matt Mahon » Wed, 08 Jun 2005 00:02:44


Looks like LZOP (byte aligned LZ77, very fast decompression)
http://www.yqcomputer.com/

-- Matt Mahoney
 
 
 

Help on discovering Compression algorithm

Post by hddsit » Thu, 09 Jun 2005 22:41:09

HI Matt,

Do u have any link (besides the source code) that explains how the lzo
algorithm works? I compressed a small text file with lzop and it doesnt
seem to be the same algorithm..

Alberto
 
 
 

Help on discovering Compression algorithm

Post by Matt Mahon » Fri, 10 Jun 2005 05:39:30


The library has a lot of variants (unfortunately). I had to look at
the source code to figure out the details of what it is doing, but it
is basically byte oriented LZ77. You might not have LZOP but something
they wrote themselves.

-- Matt Mahoney
 
 
 

Help on discovering Compression algorithm

Post by Jim Leonar » Fri, 10 Jun 2005 05:57:50


This is my main beef with LZO. Variants wouldn't be a problem if they
were documented, but all attempts at trying to get docs for any one
variant's format was met with "look at the code". The code is a mess
with craploads of #define blocks and macros for different architectures
and optimizations. Even "minilzo", which is supposed to be the
pluggable/portable version, is just as difficult to follow! The few
answers I got to satiate my curiousity were obtained by looking at the
x86 assembler source for a decoder. :-P

I will say, though, that it was educational looking at said assembler
sources because it demonstrated that LZO is *not* the fastest
compression/decompression method possible on any x86 platform, as it is
nearly universally regarded to be. Discovering that gave me enough
drive to create my own 8088 byte-aligned compression library. While my
library achieves about 10% worse compression on average than LZO's
fastest (compression) variant, it is about 25% faster.

I do concede that LZO is most likely the fastest *portable* compression
library. As always, if you have to implement fast compression for a
particular piece of hardware, always start with the hardware's
strengths and weaknesses and build up from there. (In particular, I
designed my 8088 compression library around x86's powerful string
instructions, specifically REP STOSB and REP MOVSB. (For x86 freaks:
I didn't use MOVSW and STOSW because it hurt compression and the extra
processing to deal with odd lengths slowed things down -- and besides,
on 8088, there is no advantage (although there IS an advantage on 8086
and above).))
 
 
 

Help on discovering Compression algorithm

Post by hddsit » Fri, 10 Jun 2005 18:40:05

I'm still trying to understand the algorithm from comparing compressed
and uncompressed data, this is what I know:

[ ] = Byte

[LL] [DS] [U1|UK] [U2]



LL: length of the original string -1. Range: 0x01-0xFF (upper limit not
confirmed)
DS: distance to string -1. Range: 0x02-0xFF (upper limit not confirmed)
U1: number of uncompressed bytes following the string -1. Range 0x0-0xD
(1 to 14 bytes after string). If there are more than 15 bytes after the
string then U1 = 0xF and the optionar U2 bytes is used.
UK : Unknown. probably the upper nible of DS ==> 12 bit DS
U2: number of uncompressed bytes after string. Range 0x00-0xFF (upper
limit not confirmed) U2=0x00 => 15 uncompressed bytes. Only exist if
U1=0xF.

There's a case of 2 compressed strings together that I cannot figure
out how they were encoded:

Original String: "All rights"
Encoded String: "A" 0x02 0x1A 0x04 0x5F 0xB0 "s"

0x02 0x1A point to a "ll " string ==> 0x02 = 3 bytes long, 0x1A =
position: 27 bytes back
0x04 0x5F 0xB0 point to a "right" string ==> 0x04 = 5 bytes long. 0x5F
= position 96 bytes back

how would a decoder know that 0x04 is the first byte of a new encoded
string??

There is also a long sequence of 0xFF's in the uncompressed file that
seems to be encoded as a sequence of 0xEF 0xFF 0xEF 0xFF.... that I'm
still trying to understand..

any ideas?

hdd
 
 
 

Help on discovering Compression algorithm

Post by Jacek Wawr » Sat, 11 Jun 2005 17:27:34


Short strings might be encoded in 2 bytes. Some high bits of the second byte
(possibly 2 or 3) may indicate number of literals following the string (0 in
this case).



AFAIK long sequences of 0xff bytes are very common in ROMs, so they
can have some special encoding.


Jacek Wawrzaszek