Order-preserving hashes?

Order-preserving hashes?

Post by Ben Rudiak » Wed, 26 Jul 2006 04:00:58

[added comp.compression]

This is actually a data compression question. Suppose you take a large
statistical sample of input strings and group them into equivalence classes
by hash value. In order to minimize the number of collisions, each
equivalence class should have about the same number of elements. This is the
same property that a good data compression algorithm has, if you group by a
fixed-length substring of the compressed output. Many compression algorithms
violate your ordering constraint, but order-preserving compression is not a
hard problem. The best way to construct your hash function (ignoring
performance issues) is to take the first N bits of the output of a good
order-preserving compressor applied to the input string.

Note that it's unavoidable that your hash values will only depend on the
first few bytes of the input for most inputs. E.g. for English text you
can't do better than about 8:1 compression, which means that an N-bit hash
will typically depend on only the first 8N bits of the input.

If your input strings are something like ASCII text, and you want a
reasonably fast hash function, you could try a static order-0 Huffman model.

Question for comp.compression people (since I'm curious): do any of the
algorithms mentioned at maximumcompression.com preserve lexicographic order?

-- Ben

Order-preserving hashes?

Post by Matt Mahon » Thu, 27 Jul 2006 01:50:00

Most of the PAQ algorithms, when used without preprocessing transforms,
preserve lexicographic order, or in some versions, reverse
lexicographic order. FPAQ0 does also, using a simpler model.

-- Matt Mahoney