misc: arithmatic coder success

misc: arithmatic coder success

Post by cr8819 » Sun, 20 Mar 2005 01:02:40

I have succeeded in writing my first (functional) arithmatic coder (can
encode input, and have generated data be correctly decodable).

it was a long and difficult proccess, taking much of the day...

ammusingly, it ended up looking a lot like paq, eg:
I am encoding 1 bit at a time, calling the encode bit function with a weight
indicating the probability that the value is 0.

likewise, I have ended up shifting values in and out of the range window 8
bits at a time (I started with 1 bit, then 4, and finally 8, noting that
using a larger unit has little effect on the ratio...).
for various reasons the bytes are unaligned, but this shouldn't be too hard
to change.

(or at least, in general ways it looks sort of like paq...).

for overflow, I check if basically the midpoint collides with either the
lower or upper part of the window, if so basically shifting values into/out
of the window until the range is normalized (again, ending up as 8 bit
operations as this didn't seem to hurt anything).
(all this sounds simple enough, but it was ammusingly difficult to come up
with this).

this came up as a lot simpler-seeming than the "underflow bits" scheme being
used by another encoder I have looked at a lot more (which seems to limit
everything to 1 bit at a time, and I was unable to get to work...).

for testing, I was using it in combination with a huffman encoder (eg:
arithmatic assisted huffman).
ratios aren't so good, it seems the coding doesn't really "shine" here (both
sides of a huffman node are pretty close wrt weight...). it would probably
do better if the bit-depth were fixed, eg, sorting values by probability,
and effectively encoding the index as 8 bits, but having each bit weight be
based on the weight distribution at that bit.

I am thinking it could be possible, eg, to encode values just using a simple
"curve" giving the weight for each bit in a value.

speed is acceptable I guess.
in combination with huffman, I am getting about 5MB/s for encode.
probably by making some changes (eg: direct byte-based access to buffers,
and more avoidance of single-bit io) I could get it faster.

likely, having it encode more directly rather than working as a backend for
a huffman encoder could improve things as well.

the current model is effectively static. this is not part of the core
encoder functions (which don't care, they just get a bit and a weight).
I guess maybe the model is the slow part?...

dunno, still more messing with this will be needed before I can determine
whether or not my ratios/speed are acceptable vs other stuff.


misc: arithmatic coder success

Post by Thomas Ric » Tue, 22 Mar 2005 18:08:32


If you just need to encode bits, this is of course fine. However, in
most cases you have an alphabet that is a bit larger than just two
letters, and in that case a more generic coder would be fine. For
reference, you find a good arithmetic coder in a Dr.Dobbs article
by Marc Nelson about arithmetic coding. Just google for these keywords
and you find it immediately.

For fast binary arithmetic coding, you might want to google for
the QM and the MQ coder. Both are patented IIRC, but as long as you just
want to play with them that should be fine. They are remarkable in the
sense that they do not need to carry out any multiplications or divisions.

The MQ/QM coder scheme uses a "bit stuffing" strategy to cover the problem
of "carry-overs" instead of the classical bit-count. This is also something
you might find interesting.

Good models should be adaptive to the data. Try this as an exercise. (-;

You can compute the quality of the code actually: Compare the output
size (in bits) with the (binary) zero-level entropy of the data. If the
coder is good, it should create a file that is very near the zero-order
entropy of the input data.

So long,


misc: arithmatic coder success

Post by cr8819 » Tue, 22 Mar 2005 20:25:01

"Thomas Richter" < XXXX@XXXXX.COM > wrote in message
news:d1m2ug$l96$ XXXX@XXXXX.COM ...
afaik it is faster to encode one bit at a time than one character at a time,
given a reduction in total computations?...

I can encode a whole byte, eg, as 8 bits easily enough.

this would be impressive.

yeah, some amount of assignments and arithmatic is needed to encode each
a division is needed per bit by the model, but not by the core encoder.

I want to stay free of patents if possible.

dunno, the teqnique I use seems to work (basically, once the window
"collapses", I generally flush it and such and renormalize the window).

my encoder is presntly shifting in/out 8 bits at a time.

somehow fpaq0 seems to avoid some problems, eg, if the probability gets too
close to either extreme, and also the case of underflow.

I don't know, the underflow checking and weight clamping likely costs some

I need them, as I seem to have problems with underflow and getting too close
to the extremes...

current model now is adaptive, albeit I lifted the algo from fpaq0, albeit
with more bits for the context...

I am generally using a 24 bit context, as too much more uses too much

I use different models, eg, for references. actually, a model that seems to
work here is giving a weight for each bit. there seems to be little real
pattern in the bits.

dunno what this is...

I am mostly just comparing against other encoders.

order-2 arithmatic coding+lz77:

for tar archive given on site:

I am now getting it to about 16.4% original size, and taking about 200s
(about 378kB/s). slow, but ok compression...

more diffiuclt is actually competing against my bytewise encoders/decoders,
which don't compress quite as well, but decompress quite a bit faster.

or such...

I am too drunk and too tired to really write much.


misc: arithmatic coder success

Post by Thomas Ric » Tue, 22 Mar 2005 20:46:09


But you loose context in that case. Consider the case where you encode
ASCII text; wouldn't it help to detect the letter "E" as an 8-bit entity
rather than just seeing the 0's and 1's of the ASCII encoding?

Yes, but this is sub-optimal. In other words, the "bit model" is typically
not suitable tfor random sources to be considered.

In the simplest case: "A model" = a set of empirical probabilities for
the symbols you want to encode.

In your case:

Compute: p_0 = probability of zero bit, p_1 probability of 1-bit. Then
compare: # of input bits * (p_0 * log(p_0) + p_1 * log(p_1))
where log() is the logarithm to the base of 2 with the length of the
output (compressed) message in bits.

The bracketed expression (p_0...log()) is called the zero-order entropy
(of the bit-model of your random source).

So long,

misc: arithmatic coder success

Post by cr8819 » Wed, 23 Mar 2005 01:59:13

"Thomas Richter" < XXXX@XXXXX.COM > wrote in message
news:d1mc61$s5p$ XXXX@XXXXX.COM ...

I don't think it makes much difference.
processing things in terms of bit-patterns seems to work.

for a large glob of source:
order-0, 9 ctx bits, 75.1% original size;
order-0.5, 16 ctx bits, 47.7% original size;
order-1, 18 ctx bits, 41.6% original size;
order-1.5, 24 ctx bits, 32.8% original size.

order 1.5 is order-1, but with 6 low order bits from the oldest byte.
generally using the .5 convention for any un-even order's.
values presently being encoded are 9 bits (only the lower 8 are used for
most typical values).

big english text file:
order-0, 9 ctx bits, 61.6% original size;
order-0.5, 16 ctx bits, 39.1% original size;
order-1, 18 ctx bits, 35.3% original size;
order-1.5, 24 ctx bits, 30.4% original size.


my experiences show that my ratios are comming out better than I was getting
out of huffman. afaik, the unaligned bit-centric model may actually help,
since there is some "smearing" between the bytes and thus a possibly
stronger association.

well, maybe, but in this case I don't have the probabilities for the bytes
themselves, rather just a prediction of the next bit from the previous bits.

for references, which are more random, I am gathering a weight for each bit

using this approach for textual data (much less context, but not as good):
83.2% original size for the large glob of source;
75.4% original size for the big english text file.

it, however, works much better on references (lengths and offsets).


the number I get is a lot larger than my output bits.
your description wasn't so clear either.
p_0: 0.613998
p_1: 0.385986
p_0*log2(c_0): 13.5
p_1*log2(c_1): 8.5
e: 634097953

where p_0 is the probability of 0 in the input, and c_0 is the output 0

computing probabilities for output instead:
p_0: 0.495941
p_1: 0.504044
p_0*log2(c_0): 10.9
p_1*log2(c_1): 11.1
e: 634177413

final output bits: 8415768

e/outbits: 75.4

changing it to be as-written, e becomes negative, but it's absolule value is
still much bigger than the output bits...


misc: arithmatic coder success

Post by Matt Mahon » Wed, 23 Mar 2005 07:02:24


IIRC this is done by rounding the least probable symbol probability to
a power of 1/2, so you give up some compression. (This is done in the
arithmetic coding mode of jpeg).

The arithmetic coder in fpaq0 is a simplified version of the one in
paq. It uses multiplication, but that is not so expensive on modern
processors. The code is within about 0.0001 bpc of the Shannon limit.

The coder requires 8 coding operations to code one byte of data by
coding the bits one at a time. This is faster than the approach of
splitting up the range into 256 segments in proportion to each symbol
probability because you don't have to add up all 256 frequencies for
normalization. However it requires the model to maintain statistics
for partial byte contexts. For a zero order model, there are 255
contexts: one context for the first bit, 2 for the second, 4 for the
third, 8 for the fourth, etc. For each context, the statistics
consists of a 0 count and a 1 count.

fpaq0 codes 9 bits per symbol, one bit to code the EOF and 8 bits to
code the byte. It could be slightly faster by coding the length
explicitly and just coding 8 bits per byte (as paq does). However,
compressed size should not be affected.

AFAIK the paq/fpaq0 coder is not covered by patents.

Older arithmetic coders use 16 bit integers to maintain the lower and
upper bounds of the range, plus a carry counter so that when you have a
range like [.011111..., .100000...] the range can still be represented
with 16 bits precision. The coder outputs one bit at a time as the
bits become known.

However paq/fpaq0 uses 32 bit integers to maintain the range, and no
carry counter, and it outputs 8 bits at a time. When the leading 8
bits of the lower and upper bounds match, a byte is output. Thus, the
range normally has 24 to 32 bits precision. A probability is
represented with 12 bits. The new range is found by dropping the low
12 bits of the range and multiplying, which normally maintains 12
significant bits of precision in the result.

Occasionally the range can straddle a byte boundary, resulting in some
precision loss. In the extreme case, the probability is forced to 1/2,
but this is rare. For example, if the range is [0x37ffffff,
0x38000000], then the next coding operation forces the range to either
[0x37ffffff, 0x37ffffff] or [0x38000000, 0x3800000], in which case the
bit is coded inefficiently, but all four bytes can be output and the
range becomes [0x00000000, 0xffffffff]. You can think of the lower
bound as followed by an infinite stream of 0 bits, and the upper bound
as followed by an infinite stream of 1 bits.

A probability with 12 bits of precision is quite adequate for high end
compression. As a rule of thumb, an error E in a probability estimate
P will result in an increase in compressed size of (E/P)^2 over the
Shannon limit, so even large errors don't have much cost.

-- Matt Mahoney

misc: arithmatic coder success

Post by cr8819 » Wed, 23 Mar 2005 10:40:06

"Matt Mahoney" < XXXX@XXXXX.COM > wrote in message
news: XXXX@XXXXX.COM ...

in mine, I went back to 8 bits per symbol (overloading/escaping some of the
low-probability ones). in general this has increased compression, given a
slight increase in the range of the contexts (this may cost some for the low
probability bytes, but this may be made up for given the small set of values
that can come after an escape, and the fact that, in general, I can have a
few more context bits in memory...).

yes, this is cool.

I am using 16 bits presently, but am considering increasing it.

yes, I wondered about that...

I guess it has an advantage that it avoids needing to use a longer
representation (eg: unsigned long long) to do the multiplications, but I was
paranoid it might hurt the accuracy.

fpaq0 style: (r>>12)*w
current style: (r*w)>>16

could also just use ull for some things, and see if my performance is hurt
too bad (current processors include 64 bit int math anyways with some ops,
but I guess some depends on if gcc uses them...).

I guess it can't hurt too bad, as I wont be doing and ull division, and x86
generates the high bits anyways (which are afaik typically just discarded).

if I am lucky, for something like: (uint)(((ull)r*(ull)w)>>32) gcc could be
generating something efficient...

hmm, good point.

however, I had found (with mine at least) that I still needed to manually
renormalize sometimes otherwise I tend to have problems with some files.
I guess the need would be greatly reduced if I were using 32 bit ranges.


however, by this point I will have basically just ripped off paq entirely,
instead of using a kind of hybrid style from 2 different arithmatic

however, doing it more like paq may allow me to squeeze out a little more
compression and speed, which would be good...

[some time has passed]

yes, a lot has been done.
now I am still using 32 bit math (long long expensive), and 12 bit
probabilities, with the ((r>>12)*w) approach.
with some math fiddling, I was able to eliminate the need for underflow
testing, and also needing to keep probabilities from ramming into one of the

basic arithmatic coder compression ratios have also gone up, so I may be
able to re-integrate this with my lz77 coder and get better ratios (possibly
<15% for the gimp sources, as 15% was what I was getting previously, and
baseline ratios have improved by approx 5% with the encoder changes...).

I have now nearly doubled the speed as well (right now about 1.2MB/s).

and such...


misc: arithmatic coder success

Post by Thomas Ric » Wed, 23 Mar 2005 18:22:49


There's no c_0 in the logarithm, there's a p_0. And I forgot a minus sign, thus:

H_0 = - (p_0 * log2(p_0) + p_1 * log2(p_1))

No output bits. p_0/p_1 are the input probabilities. Entropy is a measure
of the source bitstream.

Sorry, I forgot the minus sign. The above quantity (H_0) is always positive.

However, your encoder seems to model a random source with memory (i.e. you
predict data) in which case the entropy H of the source will be of course always
smaller than H_0.

So long,

misc: arithmatic coder success

Post by cr8819 » Wed, 23 Mar 2005 22:37:21




I don't quite get what the relevance of all this is though...

current h_0:
output bits:

misc: arithmatic coder success

Post by Matt Mahon » Thu, 24 Mar 2005 00:30:20

could be

You are in luck.

#include <stdlib.h>
int main(int argc, char** argv) {
return int((unsigned long long)atoi(argv[1])
*(unsigned long long)atoi(argv[2]) >> 32);

when compiled with g++ -S (MINGW 3.4.1, Windows Me)

.file "x.cpp"
.def ___main; .scl 2; .type 32; .endef
.align 2
.globl _main
.def _main; .scl 2; .type 32; .endef
pushl %ebp
movl %esp, %ebp
pushl %ebx
subl $20, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
movl 12(%ebp), %eax
addl $4, %eax
movl (%eax), %eax
movl %eax, (%esp)
call _atoi
movl %eax, %ebx
movl 12(%ebp), %eax
addl $8, %eax
movl (%eax), %eax
movl %eax, (%esp)
call _atoi
imull %ebx
movl %edx, %eax
movl $0, %edx
movl -4(%ebp), %ebx
.def _atoi; .scl 3; .type 32; .endef

So the 64 bit multiply is done with one instruction with 32 bit
operands (input in eax and ebx, result low bits in eax and high bits in
edx) and the 32 bit shift is just a register move. This is without
optimization too.

imull %ebx
movl %edx, %eax

Unfortunately, Borland doesn't know what "long long" is.

-- Matt Mahoney

misc: arithmatic coder success

Post by cr8819 » Thu, 24 Mar 2005 09:33:40

<snip fragments>

yes, cool.

yeah, that is lame, also given long long is in the c99 spec...

however, on my computer at least, trying to use long long's was comming out
slightly slower than just using unsigned ints, and I ended up using a
paq-style multiply.

then by looking at stuff, I noticed the round off from said paq-style
multiply was signifigant (caused drift and some inflation, verifiable, eg,
when doing an encode with all neutral weights), so I added another line to
adjust the results (basically computing a difference using the low order
bits and adding this into the result).

I am just a little fussy probably, I like it so that when I am using all
neutral weights, I essentially get the input back out the other end
(possibly with a little extra padding).

I may need to check again which is faster:


or I could actually utilize the feature, eg:

I just checked, yes, cool, it is a bit faster...

I need to shift it like this since I am using 16 bit weights (I don't really
know of a good/fast way to compute 32 bit ones vs. 16 bit ones and otherwise
deal with the issue that it would take more space for storing the counts to
actually utilize higher accuracies...).

ull division is afaik a lot more expensive than an extra shift anyways...

or such...

misc: gimp tarball now down to about 14.7%, albeit slow taking approx 186s
to compress (most of which going into my function for looking up strings,
for which I am considering trying a larger and shallower hash...).

oh well, so on goes the trudge...

misc: arithmatic coder success

Post by Thomas Ric » Thu, 24 Mar 2005 18:13:20


As said, it allows you to determine the quality of the encoder without
having to do "comparisons" with others since the (zero-order) entropy
gives a lower bound to the compression (with a zero-order model, that is).

Which means that with the higher-order model you're using you're
significantly better than H_0, which is a good sign. The next step to
determine the quality of the code would be to check against higher
order entropies, namely those that approximate your model better.

So long,

misc: arithmatic coder success

Post by cr8819 » Thu, 24 Mar 2005 21:48:16

"Thomas Richter" < XXXX@XXXXX.COM > wrote in message
news:d1rbvg$3gc$ XXXX@XXXXX.COM ...

ok, however, I don't know how to determine them.

by comparrison with other encoders, and endless fiddling, I can gradually
drive up the ratio, and I have managed to match ratios, eg, with winrar and
7zip (I am also within the same time domain as 7zip, but not winrar).

I have found my arithmatic coder to be the main time limiter though. I can
drop the effort the lz encoder goes through (I have changed the algo I am
using for matching, mostly as my old approach was using far too much
memory). the new one doesn't really need more memory to have better matches,
but isn't exactly that "light" either.
my new algo can go faster and still do better though, eg, by limiting the
"search depth".

however, little can be done for the arithmatic coder.
the biggie is a single spot, namely a single division done per bit to
calculate the weights, which seems to be using up nearly all the time...


somehow, this division is quite expensive...

just, I haven't found a cheaper way to get around it...
the cost: the plain entopy coder has an upper limit of about 3.3MB/s with
the "-O3" compiler option...

this greatly limits my decode speed, eg, my lz decoder is only going about
8.4MB/s and taking about 9.4s (some speed is made up by the inflation).

some minor speed gains are made by limiting my context bits (to an order-1
vs. order-2 context), eg: 2.8MB encode (28s for 79MB) and 10.7MB decode
(7.39s), but this is not signifigant (code complexity is uneffected, my
guess is this can be attributed to needing to access less memory).

dropping the order has a minor effect, eg, the ratio goes from 14.53% to
14.74% original size.

I guess this is a cost, and I guess others have probably ran into this
various hacks do little, as most of them cost more than the problem (trying
things from precomputed indices, to trying to lock the model per-entry).

no good fix seems to exist, I guess this is a limit...

my simpler bytewise encoder, which gets kind-of worse compression due to
it's lack of entropy coding (eg: presently about 17.8% original size) has
the advantage of being much faster (and I could speed it up more by
propagating, eg, some newer hashing algos and similar back into it...).

it is ammusing some, in some cases entropy coding doesn't offer much

and as an added bonus: my bytewise decoder is very much simpler...

it would just seem quite odd though, eg, if entropy encoding is worth as
little as my experience seems to be showing. it works so well by itself, and
seems so good on paper, yet, somehow, can only squeeze a few extra percent
out of lz-encoded data?...

maybe it is because in my vs. trials for reference encoding, my bytewise
scheme didn't do too much worse than the arithmatic coded scheme (eg: within
"a few bits").

why is it that I can damn near beat out bzip2 without any entopy coding
anyways? is it the allmighty lz style string matching?...

a few percent worse, 20-100x faster for the decoder...

why then do nearly all major algos use entropy coding if it makes so little
difference? maybe it is the "in-general" case, eg, it makes a big difference
(not all files are likely to compress well with lz, but may do much better,
say, with order-3 ar

misc: arithmatic coder success

Post by Matt Mahon » Fri, 25 Mar 2005 08:30:19

Floating point division is faster than int on x86 (and float is faster
than double).


It might be that BW is best suited for uniform data, and the gimp
tarball has a mix of data types. LZ is more adaptive.

files, so

gzip only uses a 32K buffer. The gimp tarball has some redundancies
that span the length of the file. Any compressor that uses a lot of
memory will do better.

misc: arithmatic coder success

Post by cr8819 » Fri, 25 Mar 2005 10:04:14

"Matt Mahoney" < XXXX@XXXXX.COM > wrote in message
news: XXXX@XXXXX.COM ...

however, simply changing the types causes a slowdown, so if it is faster I
guess the change would be more "involved" than some simple typecastes...

yes, possibly.

yes, ok.

my guess is gzip is probably a rather efficient coding, just one that
doesn't scale so well?...

my bytewise encoder isn't so efficient I guess, but wins due to it's rather
large dictionary, and has the advantage of being rather fast (at least for
in-memory decompression).

for encoding the gimp tarball, my bytewise encoder uses approx 500MB ram,
and my arithmatic coded variant uses about the same.

decode is much less, eg, around the size of the encoded data, and about 2x
the decoded size (plus any memory needed for arithmatic decoding, eg, 256kB
for an order-1 context, and about 32kB for the length/distance contexts).

I am thinking of trying a file-only decompressor for the bytewise variant
(basically, it will not use any explicit buffers), and seeing if the speed
is still good...

at present a good upper limit is a few-hundred MB at once, as otherwise the
memory requirements get too high (part of why I had changed the hashing algo
was that my older one was taking a lot more ram, and I was getting a >1GB
process image...).

a file-only decoder could at least decode much larger files.
another thought is it may be necessary (for the encoder) to divide large
files into "chunks", effectively limiting the dictionary to that chunk.

with a 64MB chunk size, memory requirements would at least be constrained.

I guess my arithmatic coded variant is better for smaller files (or bigger
ones, if one can tolerate single digit 2 and 4MB/s encode/decode speeds).
it beats out gzip at least for not-quite-so-large files (hundreds of kB to a
few MB). for small files (eg: <100k), however, gzip wins quite easily.

a thing I am noticing now is that maybe my arithmatic coder adjusts too
slowly? changing the predictor such that it keeps the counts smaller (eg:
<256) causes the ratio to improve some, and improves encode/decode speed
some (more decode, which rose to about 5.5MB/s vs about 4.1).

I guess it is probably ok, decoding the gimp tarball only takes about 15s
anyways, still this is a lot slower than the others...