^up^ [01] >>

# 2-State and Binomial Distributions

Online at   http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/   with hyperlinks to other resources.

(from first principles)

An experiment with two outcomes (states), e.g. H (head) or T (tail), is called a Bernoulli trial.

• Consider `n' independent trials.
• A random variable, X, equals the number of times H occurs,
• X has a binomial distribution
```               n!      k      n-k
P(X=k) = --------- p  (1-p)
k! (n-k)!
```
Here we estimate p.

<< [02] >>

Given the result of a sequence of trails, e.g. H H T H T ..., h=#H, t=#T=(n-h), there are three "obvious" ways to transmit the sequence:

1. Transmit h, the number of heads, then transmit the particular combination of h heads and (n-h) tails.

2. Use an adaptive code, based on the observed frequencies to the current point.

3. Transmit an estimate of p, and then transmit the sequence using a code based on the estimate.
Wallace & Boulton (1968, 1969), showed that these 3 methods give (nearly) equal message lengths.

<< [03] >>

# Method 1

1. Transmit h, 0 <= h <= n, and
2. transmit the particular combination of H's and T's.

Assume a uniform prior on h. Transmit h costs `log(n+1)`.

Given h, the number of possible combinations is
```     n!
---------
h! (n-h)!
```
all are equally likely. Thus it takes `log(n! / (h! (n-h)!))` to transmit the data.
The total message length is therefore `log((n+1)! / (h! (n-h)!))`

<< [04] >>

# Method 2: Adaptive Code

Keep counters of #H and #T transmitted so far. Initialise counters to 1 (cannot be 0, but we see why 1 elsewhere)

```e.g.   H   T   T   H   T   T   T   H   T  T
#H : 1   2   2   2   3   3   3   3   4   4
#T : 1   1   2   3   3   4   5   6   6   7
sum: 2   3   4   5   6   7   8   9  10  11

#   1   1   2   2   3   4   5   3   6   7
P= - = -   -   -   -   -   -   -   -  --  --
sum  2   3   4   5   6   7   8   9  10  11

P = h!.(n-h)!/(n+1)!
msgLen = log( (n+1)! / (h! (n-h)!))
```

<< [05] >>

# Method 3: Transmit Estimate of p=P(H)

Two part message
1. estimate P'(H) of p, to some finite accuracy, and
2. data given estimate, `-h.log(P'(H))-t.log(P'(T))`

<< [06] >>

r = h / n

Let r(H) = P'(H)+a(H) etc. where a(H)+a(T)=0.

There is a penalty, D, to pay for using a code based on P'(H) rather than on r. But on average we win because of the saving in stating the estimate.

```D= -n{ SUM  (P'(m) + a(m))log P'(m)
m:HT

-SUM  (P'(m) + a(m))log(P'(m)+a(m))}
m:HT
```

<< [07] >>
```= n.SUM  (P'(m)+a(m))log(1 + a(m)/P'(m))
m:HT
```
assume a(m)/P'(m) is small, expand log(1+a(m)/P'(m)) to 2nd term using log(1+x)=1.x-x2/2+x3/3-...
```                        a(m)    a(m)2
~ n.SUM  (P'(m)+a(m)).{----- - -----  }
m:HT              P'(m)   2P'(m)2
```
multiply out...

<< [08] >>
...multiply out
```              a(m)2     a(m)2   a(m)3
n.SUM {a(m)- ------- + ----- - --------}
m:HT      2 P'(m)   P'(m)   2 P'(m)2
```
Simplify, drop cubic term. NB. `SUM a(m) = 0`.
```    n            2
D ~ -.SUM  { a(m) /P'(m) }
2  m:HT
```

<< [09] >>

The extra cost in part 2 from using P'(H) is

```D = (n/2).{a(H)2/P'(H) + a(T)2/P'(T)}

= (n/2). a(H)2 / (P'(H).(1-P'(H)))
```
because a(T) = -a(H)

Note, D is symmetric, and is larger for P'(H) nearer to 0 or 1.

<< [10] >>

We are going to state P'(H) to some optimal, finite accuracy, ±x/2, i.e. over the range `[P'(H)-x/2, P'(H)+x/2]`.

The average value of `y2=a(H)2` over the range is the integral:

 ```1 - . x ``` ``` |x/2 | | -x/2| ``` ```y2 dy ```
which is
```  (1/x) . { y3 / 3}[-x/2,+x/2]

= (1/x) . { 2 x3 / 3 / 8 }

= x2 / 12
```

<< [11] >>

We can choose x to optimise the total message length, part 1 plus part 2.

The uniform prior implies that the cost of stating the estimate is `-log(x)`.

Minimise expected `-log(x)+E(D)`, i.e.
```                  x2
- log(x) + n.-------------------         --see #09
(24 P'(H) (1-P'(H))
```
differentiate w.r.t. x and set to zero...

<< [12] >>
```-1/x + n.x/(12 P'(H) (1-P'(H))) = 0

x2 = 12 . P'(H) . (1-P'(H)) / n
```
substitute into `-log(x)+E(D)`, the min' average increase is
``` 1/2 log(n/12)
-1/2 {log(P'(H)) - log(1-P'(H))}
+n.12.P'(H).(1-P'(H))/n /(24.P'(H).(1-P'(H)))

= 1/2{log(n/12) -log(P'(H)) -log(1-P'(H)) +1}
```
after some cancellation.

<< [13] >>
Strictly, the code must be designed before the data is available, i.e. before r(H) could be calculated.
But, if x is small, we can do the averaging around r(H) as an approximation. The message length under method 3 is
```1/2 {log(n/12) + 1}
- SUM[m:HT](n.r(m) + 1/2).log(r(m))
```
equivalently
```1/2 {log(n/12) + 1 - SUM[m:HT]log(r(m))}
- h.log(h/n) - t.log(t/n)
```
Note, the last two terms are the msgLen of D under an "optimal" code.

<< [14] >>

# Go back to methods 1 and 2

```  log( (n+1)! /( h! t! ) )

= log(n+1) + log( n! / (h! t!) )

= log(n+1)
+n log n -n +1/2 log n +1/2 log(2pi)
-h log h +h -1/2 log h -1/2 log(2pi)
-t log t +t -1/2 log t -1/2 log(2pi)
```
by Stirling's approximation

<< [15] >>
```=  log(n+1) -h.log(h/n) -t.log(t/n)
+ 1/2{log n - log h - log t - log(2 pi)}

= 1/2{log((n+1)2/n)
-log(h/n)-log(t/n)-log(2pi)}
- h.log(h/n) -t.log(t/n)
```
[recall   `r(H) = h/n,  r(T) = t/n`]

<< [16] >>
```= method_3 - 1/2{log(n/12) + 1}
+ 1/2{log((n+1)2/n) -log(2pi)}

=  method_3
+ 1/2{2.log(1+1/n) +log(12) -1 -log(2pi)}

= method_3 - 1/2 log(pi.e/6)   -- for large n

= method_3 - 0.176 nits
```

<< [17]

# Summary

The message lengths of methods 1 and 2 are equal, and that of method 3 is a very little more.

The small extra cost of method 3 has been explained as its being the only method to state an estimate of p, i.e. give an opinion.

Before (and sometimes after) W+B 1968, different expressions for "the" information content of a discrete sequence were used with very different results - clearly unsatisfactory.

A general form for a two-part message was derived later and the above turns out to be a special case of it.

• C. S. Wallace & D. M. Boulton. An Information Measure for Classification. Computer Journal 11(2) pp185-194, Aug 1968.
• D. M. Boulton & C. S. Wallace. The Information Content of a Multistate Distribution. J. Theor. Biol. 23 pp269-278, 1969.

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux)",   charset=iso-8859-1