The standard algorithm for generating Huffman codes assumes that

the codes can be arbitrarily long. Is there an efficient algorithm

to generate an optimal set of codes subject to the constraint that

no code can be longer than a specified length?

The only algorithm I have found that addresses this problem is the

one in gzip, which does not produce an optimal result. It works in

three passes:

1. Generate an optimal Huffman tree. If no code exceeds the maximum

length, we are done.

2. Modify the tree so that its depth does not exceed max_depth, where

max_depth is the maximum code length. In this step, we don't actually

perform the modifications described below. Instead, we just keep track

of the number of leaves at each depth that would exist if we performed

the modifications.

A. Remove all nodes whose depth is greater than max_depth from the

tree. Store the removed leaf nodes in set S, and discard the

non-leaf nodes.

B. Remove all non-leaf nodes whose depth is max_depth from the

tree, replacing them with leaf nodes in set S. Remove these

leaf nodes from set S.

C. For each node N1 in set S, chose a leaf node N2 from the tree

such that

1. The depth of N2 is less than max_depth.

2. The depth of N2 is at least as great as any other leaf whose

depth is less than max_depth.

Remove N2 from the tree, and replace it with a non-leaf node

whose children are N1 and N2.

3. Count the number of leaves at each depth in the tree produced in

step 2. Sort the leaves by symbol frequency, and assign depths

to the leaves such that the leaves representing the most frequently

occurring symbols have the smallest depth values.

The following example demonstrates that the algorithm used in gzip doesn't

produce optimal results. The first two columns below specify the frequency

counts for a set of symbols. The third column shows the code lengths for

an optimal Huffman tree, assuming that there is no limit on the length of

a code. The fourth column shows the code lengths produced by the gzip

algorithm when the maximum code length is set to three. The last column

shows a set of code lengths which are better than those produced by the

gzip algorithm.

-------- lengths ---------

symbol freqency initial gzip optimal

A 8 1 1 2

B 7 2 3 2

C 4 3 3 2

D 1 4 3 3

E 1 4 3 3

compressed size: 42 47 44

[snip]

Could you detail what causes such a requirement?

Are there any practical applications for that?

Why doesn't canonical Huffman code suit your application?

Regards,

Alex Vinokur

http://www.yqcomputer.com/

http://www.yqcomputer.com/

Yes, for example performance reasons. Limiting the code length to 32

bits can improve performance of the encoding/decoding. A Google search

for "huffman length limited" give some interesting results.

/Jesper Nordenberg

XXXX@XXXXX.COM (Kenneth Almquist) wrote in

One way though not the fastest is to increase the wieght of less common

symbols till a shorter tree is first obtained that matches the length

you are striving for in this example start with 1 and go up small amounts.

say you reach 2.1 then the frequency count is

A 8

B 7

C 4

D 2.1

E 2.1

you bring DE to togeth for a weith of 4.2 then combint that with

next lowest one which is C with a weight of 4 the tree know weigh

8.2 the next two lowest nodes are the A and B so they are combined

then the last two left givine the following

symbol length huffman code

A 2 10

B 2 01

C 2 11

D 3 001

E 3 000

If you don't shorten you get the folowing where D and E freq of 1

symbol length huffman code

A 1 1

B 2 01

C 3 001

D 4 0001

E 4 0000

David A. Scott

--

My Crypto code

http://www.yqcomputer.com/

http://www.yqcomputer.com/

My Compression code http://www.yqcomputer.com/

**TO EMAIL ME drop the roman "five" **

Disclaimer:I am in no way responsible for any of the statements

made in the above text. For all I know I might be drugged.

As a famous person once said "any cryptograhic

system is only as strong as its weakest link"

"Alex Vinokur" < XXXX@XXXXX.COM > asked:

Two reasons. First, with a canonical Huffman code, the tree is encoded

by encoding the length of each code. So the amount of space required to

store the tree depends on how much space is required to store the lengths.

Less space is required if you have a known maximum code length.

Second, if the maximum length of a code is reasonably small, you can

perform decoding using a single table lookup. Otherwise, you need a

more complicated scheme using multiple tables.

I am trying to read and write data structures. To read in a data

structure, I write a parser which calls get_next_symbol(encoding)

when it's ready to process the next symbol. The "encoding" parameter

specifies the Huffman codes used to represent the symbol; different

symbol encodings are used at different points in the parsing process.

The point is that get_next_symbol is called in a lot of places. I

would like it to be simple enough that the compiler will inline it.

Kenneth Almquist

Two reasons. First, with a canonical Huffman code, the tree is encoded

by encoding the length of each code. So the amount of space required to

store the tree depends on how much space is required to store the lengths.

Less space is required if you have a known maximum code length.

Second, if the maximum length of a code is reasonably small, you can

perform decoding using a single table lookup. Otherwise, you need a

more complicated scheme using multiple tables.

I am trying to read and write data structures. To read in a data

structure, I write a parser which calls get_next_symbol(encoding)

when it's ready to process the next symbol. The "encoding" parameter

specifies the Huffman codes used to represent the symbol; different

symbol encodings are used at different points in the parsing process.

The point is that get_next_symbol is called in a lot of places. I

would like it to be simple enough that the compiler will inline it.

Kenneth Almquist

Thanks. The "package-merge" algorithm by Lawrence Larmore and Daniel

Hirschberg runs in O(n*L) algorithm, where n is the number of symbols

and L is the maximum code length.[1] There's also the "warm-up"

algorithm, which is not optimal but produces optimal results in many

cases and has a provable worst bound.[2, 3].

In the case of gzip, replacing the existing algorithm with an optimal

algorithm such as package-merge produces a minimal benefit. The best

result reported in [4] is that using the package-merge algorithm

reduced the size of a compressed file from 1,817,536 bytes to 1,817,533

bytes, a savings of 0.00017%. On the hand, in the example I gave at

the start of the thread, using an optimal algorithm reduced the size

of the compressed data by 6.4%, so there are values of L and n for

which using an optimal algorithm will produce a significant benefit.

Kenneth Almquist

[1] A Fast Algorithm for Optimal Length-Limited Huffman Codes, by

Lawrence L. Larmore and Daniel S. Hirschberg

< http://www.yqcomputer.com/ ~dan/pubs/LenLimHuff.ps.gz>

[2] The WARM-UP Algorithm: A Lagrangean Construction of Length

Restricted Huffman Codes, by Ruy Luiz Milidiu, Eduardo Sany Laber

<ftp://ftp.inf.puc-rio.br/pub/docs/techreports/96_15_milidiu.ps.gz>

[3] Efficient Implementation of the WARM-UP Algorithm for the

Construction of Length-Restricted Prefix Codes, by

Ruy Luiz Milidiu, Artur Alves Pessoa, Eduardo Sany Laber

< http://www.yqcomputer.com/ %2C349071%2C1%2C0.25%2CDownload/ http://www.yqcomputer.com/ :zSzzSzftp.learn.fplf.org.brzSzpubzSzpublicacoeszSzps.gzzSzalenex99.ps.gz/milidiu99efficient.ps.gz>

[4] Improved GNU gzip Data Compression, by Roger Sayle

< http://www.yqcomputer.com/ ;

See

[1] Lawrence L. Larmore, Daniel S. Hirschberg "A Fast Algorithm for

Optimal Length-Limited Huffman Codes"

[2] Andrew Turpin, Alistair Moffat "Practical Length-Limited Codes

for Large Alphabets"

In [1] an optimal O(nL) time algorithm is presented where n=symbol

count and L=maximum code word length. This algorithm is further

optimized in [2] in terms of memory demand / speed.

The problem "OLBPC" (optimal length bounded prefix code) can be

reduced to CCP (Coin Collector's Problem) which is in turn similar

to the KnappSack problem but can be solved much faster.

The optimized algorithm presented in [2] is nearly as fast as an

optimized Huffman algorithm (still O(nL) time complexity but much

faster than the first version of Lawrence and Hirschberg)

This optimized algorithm looks suprisingly simple though you won't

understand why & how it works by looking at the pseudo code instead

of studying these papers and their proofs.

HTH,

Sebastian

Oups... sorry, I didn't see this message.

Yeah, the package-merge algorithm is the way to go. :)

You might be interested in the optimization of Turpin and Moffat as

well. It's a very fast version of the package-merge algo.

Ghis!

Sebastian

"Kenneth Almquist" < XXXX@XXXXX.COM > wrote in message news:WKr6d.3195$OX.2510@trndny07...

[snip]

In the context of this question, several kinds of criterions for building Huffman trees can be considered.

Here are some of them:

1) Minimum cost (weighted external path length the tree) - criterion of the canonical Huffman algorithm;

2) Limit on height of the tree;

3) Limit on sum of codeword sizes.

Minimum cost is Data Transmission criterion,

Limits on height and sum are Data Storage criterions.

Those criterions can be combine in a triple-criterion.

Here is a table of possible triple-criterions for Huffman codes.

Table 1. Huffman algorithm triple-criterions

----------------------------------------------------------------------------------

| Criterion : Triple-criterion : Note |

| : priorities P[i] : |

| :--------------------------------------: |

| : P[1] : P[2] : P[3] : |

|--------------------------------------------------------------------------------|

| TC-0 : Cost : No : No : Canonical Huffman algorithm |

| TC-1 : Cost : Height : Code-Sizes : Canonical Huffman algorithm |

| TC-2 : Cost : Code-Sizes : Height : Canonical Huffman algorithm |

| TC-3 : Height : Cost : Code-Sizes : Limited Length |

| TC-4 : Height : Code-Sizes : Cost : Limited Length & Code-Sizes |

| TC-5 : Code-Sizes : Cost : Height : Limited Code-Sizes Code |

| TC-6 : Code-Sizes : Height : Cost : Limited Code-Sizes & Length |

----------------------------------------------------------------------------------

Note. 'Cost' means 'Huffman tree cost'

'Code-Sizes' means 'Sum of codeword sizes'

For instance, TC-3 means: while building Huffman

first of all one should limit height of tree,

after that minimize cost of the tree,

after that minimize sum of codeword sizes.

Of course, Huffman trees for different triple-criterions are usually quite different ones (trees).

How to build Huffman trees for different triple-criterions?

n-ary Huffman Template Algorithm was used to do that.

That algorithm can be seen at

* http://alexvn.freeservers.com/s1/huffman_template_algorithm.html

* http://sourceforge.net/projects/huffman-ta/

The Huffman Template Algorithm is not aimed to _efficiently_ generate Huffman trees.

It is aimed

* to use _canonical_ Huffman algorithm non-numerical weights as well

and

* to produce _comprehensive_ output.

To build Huffman trees for different triple-criterions an extended weight was used.

The extended weight is a couple <actual weight, delta weight>.

Actual weight is ordinary weight (of a symbol), delta weight is generated while some loop.

Here is a test set of symbols and their (actual) weights.

A 1

B 1

C 1

D 2

E 3

F 7

G 10

H 13

I 15

J 17

K 26

L 35

Table 2. Several Huffman codes for the set above.

-----------------------------------------------------------------

| Method : Cri- : Tree : Tree cost : Code- : Note |

| : te- : Height :----------------: Sizes : |

| : rion : : Actual : Delta : : |

|---------------

[snip]

In the context of this question, several kinds of criterions for building Huffman trees can be considered.

Here are some of them:

1) Minimum cost (weighted external path length the tree) - criterion of the canonical Huffman algorithm;

2) Limit on height of the tree;

3) Limit on sum of codeword sizes.

Minimum cost is Data Transmission criterion,

Limits on height and sum are Data Storage criterions.

Those criterions can be combine in a triple-criterion.

Here is a table of possible triple-criterions for Huffman codes.

Table 1. Huffman algorithm triple-criterions

----------------------------------------------------------------------------------

| Criterion : Triple-criterion : Note |

| : priorities P[i] : |

| :--------------------------------------: |

| : P[1] : P[2] : P[3] : |

|--------------------------------------------------------------------------------|

| TC-0 : Cost : No : No : Canonical Huffman algorithm |

| TC-1 : Cost : Height : Code-Sizes : Canonical Huffman algorithm |

| TC-2 : Cost : Code-Sizes : Height : Canonical Huffman algorithm |

| TC-3 : Height : Cost : Code-Sizes : Limited Length |

| TC-4 : Height : Code-Sizes : Cost : Limited Length & Code-Sizes |

| TC-5 : Code-Sizes : Cost : Height : Limited Code-Sizes Code |

| TC-6 : Code-Sizes : Height : Cost : Limited Code-Sizes & Length |

----------------------------------------------------------------------------------

Note. 'Cost' means 'Huffman tree cost'

'Code-Sizes' means 'Sum of codeword sizes'

For instance, TC-3 means: while building Huffman

first of all one should limit height of tree,

after that minimize cost of the tree,

after that minimize sum of codeword sizes.

Of course, Huffman trees for different triple-criterions are usually quite different ones (trees).

How to build Huffman trees for different triple-criterions?

n-ary Huffman Template Algorithm was used to do that.

That algorithm can be seen at

* http://alexvn.freeservers.com/s1/huffman_template_algorithm.html

* http://sourceforge.net/projects/huffman-ta/

The Huffman Template Algorithm is not aimed to _efficiently_ generate Huffman trees.

It is aimed

* to use _canonical_ Huffman algorithm non-numerical weights as well

and

* to produce _comprehensive_ output.

To build Huffman trees for different triple-criterions an extended weight was used.

The extended weight is a couple <actual weight, delta weight>.

Actual weight is ordinary weight (of a symbol), delta weight is generated while some loop.

Here is a test set of symbols and their (actual) weights.

A 1

B 1

C 1

D 2

E 3

F 7

G 10

H 13

I 15

J 17

K 26

L 35

Table 2. Several Huffman codes for the set above.

-----------------------------------------------------------------

| Method : Cri- : Tree : Tree cost : Code- : Note |

| : te- : Height :----------------: Sizes : |

| : rion : : Actual : Delta : : |

|---------------

XXXX@XXXXX.COM (Kenneth Almquist) wrote in

Actually this is not true. THere are several ways to store the tree

if one must. The length is not the best way not even close. As a simple

example suspose you with to store the tree as for a file the has the

basic 256 symbols. Where roughly half the symbols are used. if only three

lengths are being used. say 10 bits 8 bits 4 bits. You need only store

00 for the not in use 01 for shortest length 10 for next length and 11 for

the longest length. This would carry the exact same information but would

require less information than the actually length there would be only way

to construct the tree so it would be a waste to use actaul length.

You could also do it as a fixed length permutation table or a varible

length permutation table if one wanted that would save more length. You

could even do it completely bijective to the file space though it would

be harder to code and if needed I suspect an adaptive approach would be

better most of the time.

As computers get faster and since memorty only seems to get cheaper

this should seldom be a concern.

David A. Scott

--

My Crypto code

http://www.yqcomputer.com/

http://www.yqcomputer.com/

My Compression code http://www.yqcomputer.com/

**TO EMAIL ME drop the roman "five" **

Disclaimer:I am in no way responsible for any of the statements

made in the above text. For all I know I might be drugged.

As a famous person once said "any cryptograhic

system is only as strong as its weakest link"

Thanks. The search results are very interesting?

Are there any available program implementations of building length-limited Huffman codes?

--

Alex Vinokur

email: alex DOT vinokur AT gmail DOT com

http://www.yqcomputer.com/

http://www.yqcomputer.com/

[snip]

Do those algorithms (Lawrence Larmore and Daniel's; Turpin and Moffat's) produce optimal results?

--

Alex Vinokur

email: alex DOT vinokur AT gmail DOT com

http://www.yqcomputer.com/

http://www.yqcomputer.com/

On Mon, 8 Nov 2004 15:55:07 +0200, "Alex Vinokur"

Although we know that the standard Huffman coding is optimal, the

strict optimality to achieve the Shannon entropy holds if and only if

the length of a source sequnce is infinite in general.

Thus, if you describe more detail about your term: optimal results, he

wil be probably easier to explain his statement again.

- James ( XXXX@XXXXX.COM )

Although we know that the standard Huffman coding is optimal, the

strict optimality to achieve the Shannon entropy holds if and only if

the length of a source sequnce is infinite in general.

Thus, if you describe more detail about your term: optimal results, he

wil be probably easier to explain his statement again.

- James ( XXXX@XXXXX.COM )

Given. A set of symbols and their costs; size of the set is equal 'n'.

Size of the tree is number of its leaf nodes.

Definition 1. A binary tree T of size 'n' is optimal if T has minimum weighted path length over all possible trees of size 'n' with

the same set of leaf nodes.

The canonical Huffman algorithm builds an optimal binary tree.

Definition 2. A lenght limited binary tree T of size 'n' and maximum height 'm' is optimal if T has minimum weighted path length

over all possible trees of size 'n' and maximum height 'm' with the same set of leaf nodes.

The question is: do Lawrence Larmore and Daniel's algorithm Turpin and Moffat's algorithm produce optimal binary trees (one means

the optimality according to Definition 2).

--

Alex Vinokur

email: alex DOT vinokur AT gmail DOT com

http://www.yqcomputer.com/

http://www.yqcomputer.com/

Yes. These algorithm's solutions of code word lengths l=(l_1, l_2,

..., l_n) minimize the weigted code word length

\sum_{i=1}^{n} w_i l_i // w_i being positive weights

while satisfying

1) \sum_i 2^{-l_i} <= 1 // you can build a prefix free code

2) \forall i l_i <= L // code word lengths are bounded

Sometimes there are more than one optimal solution (with the same

weighted code word length) but there's never a better solution. So

it's an optimal solution.

Ghis!

Sebastian

2. Maximum code length in Adaptive Huffman

3. [Log File] Creating Length-Limited Huffman Codes Using n-ary Huffman Template Algorithm

4. Reverse Huffman coding (was: Huffman coding with non-uniform symbol priorities)

5. A quantum analog of Huffman coding vs. a Huffman coding

6. STUPIDENT:: Generating Maximum Length Sequence using Galois LFSR

7. Generating Maximum Length Sequence using Galois LFSR

8. SAS code line maximum length 960

9. Maximum length of zlib-compressed data wrt input length

10. is there a maximum code length?

12. Invalid key length when increasing character type field to near maximum length

13. Index of length xx exceeds the maximum length

14. Length of the full path exceeds the maximum path length allowed.

15. Text expression length exceeds maximum length - what to do?