^up^ [01] >>

# Multistate and Multinomial Distributions

Online at   http://www.csse.monash.edu.au/~lloyd/Archive/2005-09-Mstate/index.shtml   with hyper-links to other resources.

[Recall] general message length form & Fisher:

Model complexity, -log(H), depends on precision parameter(s) are stated to.

F(T) = Ex det'( d2/d Ti Tj { -ln f(x|T)})     (the Fisher).

msgLen ~
```                       1
-ln h(T) - ln f(x|T) + - ln F(T)
2

1         1
- - ln 12 + -
2         2
```
T estimate a.k.a. theta,   x data,   Ex expectation,   h() prior on param(s),   f(x|T) likelihood.

<< [02] >>

# e.g. 3-States, M=3

• two parameters, T1 and T2 in [0,1]; i.e. T=<T1,T2>.

• also define T3, also in [0,1], where T3=1-T1-T2,

• observe ni of statei, i=1..3, where N=n1+n2+n3.

• The likelihood, LH = T1n1.T2n2.T3n3, so ...

<< [03] >>

<< [04] >>

Negative log likelihood

```-log LH =

-n1 log T1 - n2 log T2 - n3 log(1-T1-T2)
```

The Fisher information is the expected value of the determinant of the matrix of 2nd derivatives of the -log likelihood.

<< [05] >>

1st derivates

```d/d T1 {-log LH}  = -n1/T1 + n3/(1-T1-T2)

d/d T2 {-log LH}  = -n2/T2 + n3/(1-T1-T2)
```

2nd derivates

```d2/d T12 {-log LH} = n1/T12 + n3/(1-T1-T2)2

d2/d T22 {-log LH} = n1/T22 + n3/(1-T1-T2)2
```

<< [06] >>

off-diagonal entries

```  d2/d T1 d T2 {-log LH} = n3/(1-T1-T2)2

= d2/d T2 d T1 {-log LH}
```

symmetric, as it happens.

The expection of n1 over the data space is N.T1, similarly for n2 and n3, so the Fisher is ...

<< [07] >>

Fisher:

 ``` | | | | ``` ```N/T1+N/T3 N/T3 N/T3 N/T2+N/T3 ``` ```| | | | ```

<< [08] >>
 ``` N2 = -- T32 ``` ```| | | | ``` ```(1-T2)/T1 1 1 (1-T1)/T2 ``` ```| | | | ```

<< [09] >>
```   N2  (1-T2) (1-T1)
= ---.(------.------ - 1)
T32    T1     T2

= (N2/T32).((1-T1-T2)/(T1 T2)

= N2/(T1.T2.T3)
```

-- Fisher for 3-states --

Putting this back in the general form gives our 2-part message length

<< [10] >>
L.Allison

# M-States

[It can be shown] that for M-states,

i.e. M-1 parameters, and for probabilities

T1, T2, ..., TM-1, & TM = 1-T1-...-TM-1,

T = <T1,...,TM-1>, that

F(T) = NM-1/(T1.T2...TM).
Putting this back in the general form gives our 2-part message length.

<< [11] >>

# The MML estimator

Assuming a uniform prior on T, the only parts of the general form that depend on the estimate for T are shown in red:
```                       1
-ln h(T) - ln f(x|T) + - ln F(T)
2

1         1
- - ln 12 + -
2         2
```
differentiate these parts w.r.t. Ti   . . .

<< [12] >>

don't forget, TM = 1 - T1 - ... - TM-1

 ``` d msgLen --- =  d Ti ``` ``` ni - --   Ti ``` ``` nM + --   TM ``` ``` 1 - ---   2Ti ``` ``` 1 + --- = 0 2TM ```

 ``` ni + 1/2 -------- =   Ti ``` ``` nM + 1/2 --------- TM ``` ``` for i=1..m-1 ```
So
```Ti = (ni + 1/2) / (N + M/2),  i=1..m
```

<< [13]

# Summary

For an M-state multistate distribution:

• F(T) = NM-1/(T1.T2...TM).

• The MML estimator is Ti=(ni+1/2)/(N+M/2),   i=1..M.
NB.
The no-inference, adaptive-code uses frequencies+1, efficient code, conservative "estimator".
The MML estimator uses frequences+1/2, good code, good estimator.
The maximum likelihood uses raw frequencies, no code specified, too aggressive estimator.

## References

• C. S. Wallace & D. M. Boulton. An information measure for classification. Comp. Jrnl. V11 No2 pp185-194 Aug' 1968 [paper]
• C. S. Wallace & P. R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society series B. V49 No3 pp240-265 1987 [paper]

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (IRIX)",   charset=iso-8859-1