## Generating Huffman codes with maximum length constraint

### Generating Huffman codes with maximum length constraint

The standard algorithm for generating Huffman codes assumes that
the codes can be arbitrarily long. Is there an efficient algorithm
to generate an optimal set of codes subject to the constraint that
no code can be longer than a specified length?

The only algorithm I have found that addresses this problem is the
one in gzip, which does not produce an optimal result. It works in
three passes:

1. Generate an optimal Huffman tree. If no code exceeds the maximum
length, we are done.

2. Modify the tree so that its depth does not exceed max_depth, where
max_depth is the maximum code length. In this step, we don't actually
perform the modifications described below. Instead, we just keep track
of the number of leaves at each depth that would exist if we performed
the modifications.

A. Remove all nodes whose depth is greater than max_depth from the
tree. Store the removed leaf nodes in set S, and discard the
non-leaf nodes.

B. Remove all non-leaf nodes whose depth is max_depth from the
tree, replacing them with leaf nodes in set S. Remove these
leaf nodes from set S.

C. For each node N1 in set S, chose a leaf node N2 from the tree
such that
1. The depth of N2 is less than max_depth.
2. The depth of N2 is at least as great as any other leaf whose
depth is less than max_depth.
Remove N2 from the tree, and replace it with a non-leaf node
whose children are N1 and N2.

3. Count the number of leaves at each depth in the tree produced in
step 2. Sort the leaves by symbol frequency, and assign depths
to the leaves such that the leaves representing the most frequently
occurring symbols have the smallest depth values.

The following example demonstrates that the algorithm used in gzip doesn't
produce optimal results. The first two columns below specify the frequency
counts for a set of symbols. The third column shows the code lengths for
an optimal Huffman tree, assuming that there is no limit on the length of
a code. The fourth column shows the code lengths produced by the gzip
algorithm when the maximum code length is set to three. The last column
shows a set of code lengths which are better than those produced by the
gzip algorithm.

-------- lengths ---------
symbol freqency initial gzip optimal
A 8 1 1 2
B 7 2 3 2
C 4 3 3 2
D 1 4 3 3
E 1 4 3 3

compressed size: 42 47 44

### Generating Huffman codes with maximum length constraint

[snip]

Could you detail what causes such a requirement?
Are there any practical applications for that?
Why doesn't canonical Huffman code suit your application?

Regards,
Alex Vinokur
http://www.yqcomputer.com/
http://www.yqcomputer.com/

### Generating Huffman codes with maximum length constraint

Yes, for example performance reasons. Limiting the code length to 32
bits can improve performance of the encoding/decoding. A Google search
for "huffman length limited" give some interesting results.

/Jesper Nordenberg

### Generating Huffman codes with maximum length constraint

XXXX@XXXXX.COM (Kenneth Almquist) wrote in

One way though not the fastest is to increase the wieght of less common
symbols till a shorter tree is first obtained that matches the length
you are striving for in this example start with 1 and go up small amounts.
say you reach 2.1 then the frequency count is
A 8
B 7
C 4
D 2.1
E 2.1

you bring DE to togeth for a weith of 4.2 then combint that with
next lowest one which is C with a weight of 4 the tree know weigh
8.2 the next two lowest nodes are the A and B so they are combined
then the last two left givine the following
symbol length huffman code
A 2 10
B 2 01
C 2 11
D 3 001
E 3 000

If you don't shorten you get the folowing where D and E freq of 1
symbol length huffman code
A 1 1
B 2 01
C 3 001
D 4 0001
E 4 0000

David A. Scott
--
My Crypto code
http://www.yqcomputer.com/
http://www.yqcomputer.com/
My Compression code http://www.yqcomputer.com/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

### Generating Huffman codes with maximum length constraint

"Alex Vinokur" < XXXX@XXXXX.COM > asked:

Two reasons. First, with a canonical Huffman code, the tree is encoded
by encoding the length of each code. So the amount of space required to
store the tree depends on how much space is required to store the lengths.
Less space is required if you have a known maximum code length.

Second, if the maximum length of a code is reasonably small, you can
perform decoding using a single table lookup. Otherwise, you need a
more complicated scheme using multiple tables.

I am trying to read and write data structures. To read in a data
structure, I write a parser which calls get_next_symbol(encoding)
when it's ready to process the next symbol. The "encoding" parameter
specifies the Huffman codes used to represent the symbol; different
symbol encodings are used at different points in the parsing process.
The point is that get_next_symbol is called in a lot of places. I
would like it to be simple enough that the compiler will inline it.
Kenneth Almquist

### Generating Huffman codes with maximum length constraint

Thanks. The "package-merge" algorithm by Lawrence Larmore and Daniel
Hirschberg runs in O(n*L) algorithm, where n is the number of symbols
and L is the maximum code length.[1] There's also the "warm-up"
algorithm, which is not optimal but produces optimal results in many
cases and has a provable worst bound.[2, 3].

In the case of gzip, replacing the existing algorithm with an optimal
algorithm such as package-merge produces a minimal benefit. The best
result reported in [4] is that using the package-merge algorithm
reduced the size of a compressed file from 1,817,536 bytes to 1,817,533
bytes, a savings of 0.00017%. On the hand, in the example I gave at
the start of the thread, using an optimal algorithm reduced the size
of the compressed data by 6.4%, so there are values of L and n for
which using an optimal algorithm will produce a significant benefit.
Kenneth Almquist

[1] A Fast Algorithm for Optimal Length-Limited Huffman Codes, by
Lawrence L. Larmore and Daniel S. Hirschberg
< http://www.yqcomputer.com/ ~dan/pubs/LenLimHuff.ps.gz>

[2] The WARM-UP Algorithm: A Lagrangean Construction of Length
Restricted Huffman Codes, by Ruy Luiz Milidiu, Eduardo Sany Laber
<ftp://ftp.inf.puc-rio.br/pub/docs/techreports/96_15_milidiu.ps.gz>

[3] Efficient Implementation of the WARM-UP Algorithm for the
Construction of Length-Restricted Prefix Codes, by
Ruy Luiz Milidiu, Artur Alves Pessoa, Eduardo Sany Laber

[4] Improved GNU gzip Data Compression, by Roger Sayle
< http://www.yqcomputer.com/ ;

### Generating Huffman codes with maximum length constraint

See

[1] Lawrence L. Larmore, Daniel S. Hirschberg "A Fast Algorithm for
Optimal Length-Limited Huffman Codes"

[2] Andrew Turpin, Alistair Moffat "Practical Length-Limited Codes
for Large Alphabets"

In [1] an optimal O(nL) time algorithm is presented where n=symbol
count and L=maximum code word length. This algorithm is further
optimized in [2] in terms of memory demand / speed.

The problem "OLBPC" (optimal length bounded prefix code) can be
reduced to CCP (Coin Collector's Problem) which is in turn similar
to the KnappSack problem but can be solved much faster.

The optimized algorithm presented in [2] is nearly as fast as an
optimized Huffman algorithm (still O(nL) time complexity but much
faster than the first version of Lawrence and Hirschberg)

This optimized algorithm looks suprisingly simple though you won't
understand why & how it works by looking at the pseudo code instead
of studying these papers and their proofs.

HTH,
Sebastian

### Generating Huffman codes with maximum length constraint

Oups... sorry, I didn't see this message.
Yeah, the package-merge algorithm is the way to go. :)

You might be interested in the optimization of Turpin and Moffat as
well. It's a very fast version of the package-merge algo.

Ghis!
Sebastian

### Generating Huffman codes with maximum length constraint

"Kenneth Almquist" < XXXX@XXXXX.COM > wrote in message news:WKr6d.3195\$OX.2510@trndny07...
[snip]

In the context of this question, several kinds of criterions for building Huffman trees can be considered.
Here are some of them:
1) Minimum cost (weighted external path length the tree) - criterion of the canonical Huffman algorithm;
2) Limit on height of the tree;
3) Limit on sum of codeword sizes.

Minimum cost is Data Transmission criterion,
Limits on height and sum are Data Storage criterions.

Those criterions can be combine in a triple-criterion.

Here is a table of possible triple-criterions for Huffman codes.

Table 1. Huffman algorithm triple-criterions
----------------------------------------------------------------------------------
| Criterion : Triple-criterion : Note |
| : priorities P[i] : |
| :--------------------------------------: |
| : P[1] : P[2] : P[3] : |
|--------------------------------------------------------------------------------|
| TC-0 : Cost : No : No : Canonical Huffman algorithm |
| TC-1 : Cost : Height : Code-Sizes : Canonical Huffman algorithm |
| TC-2 : Cost : Code-Sizes : Height : Canonical Huffman algorithm |
| TC-3 : Height : Cost : Code-Sizes : Limited Length |
| TC-4 : Height : Code-Sizes : Cost : Limited Length & Code-Sizes |
| TC-5 : Code-Sizes : Cost : Height : Limited Code-Sizes Code |
| TC-6 : Code-Sizes : Height : Cost : Limited Code-Sizes & Length |
----------------------------------------------------------------------------------
Note. 'Cost' means 'Huffman tree cost'
'Code-Sizes' means 'Sum of codeword sizes'

For instance, TC-3 means: while building Huffman
first of all one should limit height of tree,
after that minimize cost of the tree,
after that minimize sum of codeword sizes.

Of course, Huffman trees for different triple-criterions are usually quite different ones (trees).

How to build Huffman trees for different triple-criterions?
n-ary Huffman Template Algorithm was used to do that.
That algorithm can be seen at
* http://alexvn.freeservers.com/s1/huffman_template_algorithm.html
* http://sourceforge.net/projects/huffman-ta/

The Huffman Template Algorithm is not aimed to _efficiently_ generate Huffman trees.
It is aimed
* to use _canonical_ Huffman algorithm non-numerical weights as well
and
* to produce _comprehensive_ output.

To build Huffman trees for different triple-criterions an extended weight was used.
The extended weight is a couple <actual weight, delta weight>.
Actual weight is ordinary weight (of a symbol), delta weight is generated while some loop.

Here is a test set of symbols and their (actual) weights.

A 1
B 1
C 1
D 2
E 3
F 7
G 10
H 13
I 15
J 17
K 26
L 35

Table 2. Several Huffman codes for the set above.
-----------------------------------------------------------------
| Method : Cri- : Tree : Tree cost : Code- : Note |
| : te- : Height :----------------: Sizes : |
| : rion : : Actual : Delta : : |
|---------------

### Generating Huffman codes with maximum length constraint

XXXX@XXXXX.COM (Kenneth Almquist) wrote in

Actually this is not true. THere are several ways to store the tree
if one must. The length is not the best way not even close. As a simple
example suspose you with to store the tree as for a file the has the
basic 256 symbols. Where roughly half the symbols are used. if only three
lengths are being used. say 10 bits 8 bits 4 bits. You need only store
00 for the not in use 01 for shortest length 10 for next length and 11 for
the longest length. This would carry the exact same information but would
require less information than the actually length there would be only way
to construct the tree so it would be a waste to use actaul length.

You could also do it as a fixed length permutation table or a varible
length permutation table if one wanted that would save more length. You
could even do it completely bijective to the file space though it would
be harder to code and if needed I suspect an adaptive approach would be
better most of the time.

As computers get faster and since memorty only seems to get cheaper
this should seldom be a concern.

David A. Scott
--
My Crypto code
http://www.yqcomputer.com/
http://www.yqcomputer.com/
My Compression code http://www.yqcomputer.com/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

### Generating Huffman codes with maximum length constraint

Thanks. The search results are very interesting?
Are there any available program implementations of building length-limited Huffman codes?

--
Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://www.yqcomputer.com/
http://www.yqcomputer.com/

### Generating Huffman codes with maximum length constraint

[snip]

Do those algorithms (Lawrence Larmore and Daniel's; Turpin and Moffat's) produce optimal results?

--
Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://www.yqcomputer.com/
http://www.yqcomputer.com/

### Generating Huffman codes with maximum length constraint

On Mon, 8 Nov 2004 15:55:07 +0200, "Alex Vinokur"

Although we know that the standard Huffman coding is optimal, the
strict optimality to achieve the Shannon entropy holds if and only if
the length of a source sequnce is infinite in general.

Thus, if you describe more detail about your term: optimal results, he
wil be probably easier to explain his statement again.

- James ( XXXX@XXXXX.COM )

### Generating Huffman codes with maximum length constraint

Given. A set of symbols and their costs; size of the set is equal 'n'.
Size of the tree is number of its leaf nodes.

Definition 1. A binary tree T of size 'n' is optimal if T has minimum weighted path length over all possible trees of size 'n' with
the same set of leaf nodes.
The canonical Huffman algorithm builds an optimal binary tree.

Definition 2. A lenght limited binary tree T of size 'n' and maximum height 'm' is optimal if T has minimum weighted path length
over all possible trees of size 'n' and maximum height 'm' with the same set of leaf nodes.

The question is: do Lawrence Larmore and Daniel's algorithm Turpin and Moffat's algorithm produce optimal binary trees (one means
the optimality according to Definition 2).

--
Alex Vinokur
email: alex DOT vinokur AT gmail DOT com
http://www.yqcomputer.com/
http://www.yqcomputer.com/

### Generating Huffman codes with maximum length constraint

Yes. These algorithm's solutions of code word lengths l=(l_1, l_2,
..., l_n) minimize the weigted code word length

\sum_{i=1}^{n} w_i l_i // w_i being positive weights

while satisfying

1) \sum_i 2^{-l_i} <= 1 // you can build a prefix free code
2) \forall i l_i <= L // code word lengths are bounded

Sometimes there are more than one optimal solution (with the same
weighted code word length) but there's never a better solution. So
it's an optimal solution.

Ghis!
Sebastian