## Pre-Transmitting Length versus EOF signals.

### Pre-Transmitting Length versus EOF signals.

The whole thing comes down to whether it is better to first transmit
"N", the number of symbols and then the data, or to include an EOF
signal in-line.

There are costs to doing both.

Imagine there is a probability model for the data itself
p(s|past) available at any point. The data

First case: transmitting N ahead of time.

The cost for the data is L_data = sum(i=1..N) -log p(s_i | past) plus
the cost to transmit N. The Elias delta code can provide a code for
natural integers in less than L(N) <= log N + 2 log log (2 N ).

L = L_data + log N + 2 log log (2N)

Second case: transmititng an EOF signal in-band.

Here, we will have to model the probability of an EOF signal.
The common choice is p(EOF) = 1/(i+1) with i the number of symbols
processed. Thus at any step the codelength for the actual signal
is -log [ (1-p(EOF))*p(s_i | past) ] (since an EOF didn't occur), and then
at the end, one EOF, -log [ P(EOF) ]

this is L = L_data + \sum_(i=1..N) (- log( i/i+1) ) + log (N+1)
= L_data + 2*log(N+1)

Conclusion: the length information costs

Lelias = log N + 2 log log (2N) for pre-transmitting the length.

Leof = 2 log(N+1) for an inband EOF.

(all logarithms are base 2)

For N >= 37, the Elias code wins, providing a shorter codelength.

So, at least with this EOF model, there is an advantage to
pre-transmitting the length for files longer than 37 bits.

Note that you can make an EOF marker cost of

1 + min( Lelias, Leof )

by pre-transmitting one bit saying which scheme you will be using.

The difference between Lelias and Leof is small, being less than 6
bits for N=10000.

PS: That Elias codelength is an upper bound, more specifically

L(elias) = 1 + floor(log(N)) + 2*floor(log(1+floor(log(N))))
counts the integral concrete bits.

### Pre-Transmitting Length versus EOF signals.

Dr Chaos" < XXXX@XXXXX.COM > wrote in message
news: XXXX@XXXXX.COM ...

I'm not sure where you got this limit as I can't find any references with
this limit and it doesn't quite seem to agree with reality. The actual
length of an Elias delta code (which you noted later) is:

floor[lg x] + 2 floor[lg (1 + floor[lg x] ) ] + 1

And your formula is sometimes less than this. I think you need to add some
floor or ceilings to make it correct.

then

I certainly agree that for larger N the Elias code takes less bits than
encoding the EOF as a symbol. But one cannot say that this means it is
"better". You cannot say one is better than another without making some
assumption of what the probability distribution is for the inputs. If the
probability distribution function is heavily weighted to small values of N
then the EOF symbol wins.

The key thing is that you can only say that one code is better for some set
of probability distribution functions. You cannot say one code is
universally better than the other.

For example, let's compare the Elias delta code with a unary coding of an
integer that encodes an integer N as N-1 zeroes followed by a 1, thus
L(unary) = N. L(unary) < L(elias) only for N <= 4. Is Elias better than
Unary? It depends on what the probability distribution is for the value of
N. If all values of N are equally likely, then yes it is.

But let's consider if the value of N we were encoding was the number of coin
flips we made until it come up heads. There p( N = x ) = 2^-x. The p( N <=
x ) = 1 - 2^-x. Here the unary code is the optimal code.

### Pre-Transmitting Length versus EOF signals.

Add one bit to the formula.

L(N) <= 1+ log N + 2 log log (2 N ).

for certain needs I wanted a formula continuous in N.

Right. Less than 37.

Of course.

I would guess that in practice, the sizes of streams which need to be
compressed follow a mysterious power law distribution, like so many in
Nature.

In fact if it were P(N) ~ 1/N then you can't even normalize that to
make it a probability density function and thus the optimal prior for
coding of integers.