Online at http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/ with hyperlinks to other resources.
(from first principles)
An experiment with two outcomes (states), e.g. H (head) or T (tail), is called a Bernoulli trial.
n! k n-k P(X=k) = --------- p (1-p) k! (n-k)!Here we estimate p.
Given the result of a sequence of trails,
Assume a uniform prior on h.
Transmit h costs log(n+1)
.
n! --------- h! (n-h)!all are equally likely. Thus it takes
log(n! / (h! (n-h)!))
log((n+1)! / (h! (n-h)!))
Keep counters of #H and #T transmitted so far. Initialise counters to 1 (cannot be 0, but we see why 1 elsewhere)
e.g. H T T H T T T H T T #H : 1 2 2 2 3 3 3 3 4 4 #T : 1 1 2 3 3 4 5 6 6 7 sum: 2 3 4 5 6 7 8 9 10 11 # 1 1 2 2 3 4 5 3 6 7 P= - = - - - - - - - - -- -- sum 2 3 4 5 6 7 8 9 10 11 P = h!.(n-h)!/(n+1)! msgLen = log( (n+1)! / (h! (n-h)!))
-h.log(P'(H))-t.log(P'(T))
r = h / n
Let r(H) = P'(H)+a(H) etc. where a(H)+a(T)=0.
There is a penalty, D, to pay for using a code based on P'(H) rather than on r. But on average we win because of the saving in stating the estimate.
D= -n{ SUM (P'(m) + a(m))log P'(m) m:HT -SUM (P'(m) + a(m))log(P'(m)+a(m))} m:HT
= n.SUM (P'(m)+a(m))log(1 + a(m)/P'(m)) m:HTassume a(m)/P'(m) is small, expand log(1+a(m)/P'(m)) to 2nd term using log(1+x)=1.x-x^{2}/2+x^{3}/3-...
a(m) a(m)^{2} ~ n.SUM (P'(m)+a(m)).{----- - ----- } m:HT P'(m) 2P'(m)^{2}multiply out...
a(m)^{2} a(m)^{2} a(m)^{3} n.SUM {a(m)- ------- + ----- - --------} m:HT 2 P'(m) P'(m) 2 P'(m)^{2}Simplify, drop cubic term. NB.
SUM a(m) = 0
.n 2 D ~ -.SUM { a(m) /P'(m) } 2 m:HT
The extra cost in part 2 from using P'(H) is
D = (n/2).{a(H)^{2}/P'(H) + a(T)^{2}/P'(T)} = (n/2). a(H)^{2} / (P'(H).(1-P'(H)))because a(T) = -a(H) Note, D is symmetric, and is larger for P'(H) nearer to 0 or 1.
We are going to state P'(H) to some optimal, finite accuracy,
±x/2, i.e. over the range
[P'(H)-x/2, P'(H)+x/2]
.
The average value of y^{2}=a(H)^{2}
over the range is the integral:
1 - . x |
|x/2 | | -x/2| |
y^{2} dy |
(1/x) . { y^{3} / 3}_{[-x/2,+x/2]} = (1/x) . { 2 x^{3} / 3 / 8 } = x^{2} / 12
We can choose x to optimise the total message length, part 1 plus part 2.
The uniform prior implies that the cost of stating the estimate is
-log(x)
.
-log(x)+E(D)
, i.e.
x^{2} - log(x) + n.------------------- --see #09 (24 P'(H) (1-P'(H))differentiate w.r.t. x and set to zero...
-1/x + n.x/(12 P'(H) (1-P'(H))) = 0 x^{2} = 12 . P'(H) . (1-P'(H)) / nsubstitute into
-log(x)+E(D)
, the min' average increase is
1/2 log(n/12) -1/2 {log(P'(H)) - log(1-P'(H))} +n.12.P'(H).(1-P'(H))/n /(24.P'(H).(1-P'(H))) = 1/2{log(n/12) -log(P'(H)) -log(1-P'(H)) +1}after some cancellation.
1/2 {log(n/12) + 1} - SUM_{[m:HT]}(n.r(m) + 1/2).log(r(m))equivalently
1/2 {log(n/12) + 1 - SUM_{[m:HT]}log(r(m))} - h.log(h/n) - t.log(t/n)Note, the last two terms are the msgLen of D under an "optimal" code.
log( (n+1)! /( h! t! ) ) = log(n+1) + log( n! / (h! t!) ) = log(n+1) +n log n -n +1/2 log n +1/2 log(2pi) -h log h +h -1/2 log h -1/2 log(2pi) -t log t +t -1/2 log t -1/2 log(2pi)by Stirling's approximation
= log(n+1) -h.log(h/n) -t.log(t/n) + 1/2{log n - log h - log t - log(2 pi)} = 1/2{log((n+1)^{2}/n) -log(h/n)-log(t/n)-log(2pi)} - h.log(h/n) -t.log(t/n)[recall
r(H) = h/n, r(T) = t/n
]= method_3 - 1/2{log(n/12) + 1} + 1/2{log((n+1)^{2}/n) -log(2pi)} = method_3 + 1/2{2.log(1+1/n) +log(12) -1 -log(2pi)} = method_3 - 1/2 log(pi.e/6) -- for large n = method_3 - 0.176 nits
The message lengths of methods 1 and 2 are equal, and that of method 3 is a very little more.
The small extra cost of method 3 has been explained as its being the only method to state an estimate of p, i.e. give an opinion.
Before (and sometimes after) W+B 1968, different expressions for "the" information content of a discrete sequence were used with very different results - clearly unsatisfactory.
A general form for a two-part message was derived later and the above turns out to be a special case of it.