Information and the multinomial probability distribution

How much information is there in a discrete sequence, e.g.,
 
. . . A C G A C T A C G T A C . . .
 
DNA = {A, C, G, T},  θ = <pr(A), pr(C), pr(G)>,  pr(T) = 1 - pr(A) - pr(C) - pr(G)
 
or, e.g.,
 
H H T H T H H T H ?
 
coin={H,T},   θ = pr(H) = 1 - pr(T)
users.monash.edu/~lloyd/Seminars/2009-Multinomial/index.shtml
. . . is an important question, for example,
 
is the red segment significantly more CG-rich than the rest?
 
A T A G C G G C G T C G T A G T A A A C T C C G G C T A C C A C T G G A G T G C T A C A A T G T C C C A A C G C T G C A
 
i.e., two (or three?) distributions v. one.
 
 
 
 
There are at least three "obvious" ways to transmit multinomial data . . .
1. Combinatorial Code,
 
e.g., for binary data
 
n trials,   0 ≤ #H ≤ n,   n = #H + #T
 
n+1 possible values for #H   (all equally likely under uniform prior)
 
requires log( n + 1 ) bits to specify #H, and then
 
log( nC#H ) = log( n! / (#H! #T!) ) bits to specify the sequence | #H;  so the total
 
code length = log( (n+1)! / (#H! #T!) ) bits
(see log x!)
2. Adaptive Code
 
e.g.,
    1   2   3   4   5   6   7   8   9 product
data   H   H   T   H   T   H   H   T   H  
H counter 1 - 2 - 3 - 3 - 4 - 4 - 5 - 6 - 6 -  
T counter 1 - 1 - 1 - 2 - 2 - 3 - 3 - 3 - 4 -  
∑ counters 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10 -  
running est   1
2
  2
3
  1
4
  3
5
  2
6
  4
7
  5
8
  3
9
  6
10
#H! #T!
(n+1)!
counters initialised to 1 (cannot be 0)
 
code length = log( (n+1)! / (#H! #T!) ) bits
3. Estimate-based Code
 
(i)  state an estimate, est(θ), of θ, to optimum precision ±δ/2 (about  - log(prior(est(θ)) . δ) bits) and
 
(ii) state data | est(θ)
 
 
It turns out that
 
estmml(θ) is pr(valuei) = (#valuei + 1/2) / (n + M/2), i = 1, ..., M, for M-state data, and
 
the "penalty", over codes 1 and 2, for stating (i) and (ii), that is for including an opinion, (i), is only a fraction of a bit per parameter.
Summary
 
Codes 1 and 2 equal, code 3 very slightly longer.
 
(Non-uniform prior: Pretend there were some previous observations.)
 
Demo. (click) (http://www.allisons.org/ll/MML/Discrete/2state/#demo)
 
(answer (click))
 
Sources
 
D. M. Boulton & C. S. Wallace, The information content of a multistate distribution, J. Theor. Biol., 23, pp.269-278, 1969 (also see [WB68]).
 
C. S. Wallace & P. R. Freeman, Estimation and inference by compact coding, J. of the Royal Statistical Society, series B., 49(3), pp.240-265, 1987.
 
Also see Strict MML (SMML) and the Binomial.


© Lloyd Allison Faculty of Information Technology (Clayton), Monash University, Australia 3800.
Created with "vi (Linux & Solaris)",   charset=iso-8859-1