[01] >>
 ^MML^

# Clustering + a Markov Model

### © 2001 Lloyd Allison, Computer Science and Software Engineering, Monash University, Victoria, Australia 3800

This document is online at  users.monash.edu/~lloyd/Seminars/Mixture-MM/index.shtml   and contains hyper-links to more detailed notes on certain topics and to some of the references.

<< [02] >>

# The Problem: Clustering, (unsupervised-) Classification, Mixture Modelling, Numerical Taxonomy, in a Time-Series.

Data: A sequence of multi-variate Observations.

• Algorithm based on [Snob] which works on independent data.
See C. S. Wallace & D. M. Boulton. An Information Measure for Classification. Computer Journal, 11, pp.185-195, 1968, and more recently the special issue of the Computer Journal 42(4) 1999

• Extended: Observationk can depend on Observationk-1

• i.e. Hidden Markov Model of order 1

<< [03] >>

# The Model

(a) The number of classes (N)

(b) The relative abundance of each class

(c) For each class
• Distribution parameters for each significant attribute  (e.g. mu, sigma)
• Relative abundance of next class conditional on current class

(d) For each Observation
• An assigned class (but see twist later)
• Attribute values given this class

<< [04] >>

Model Complexity:

• Bayes (1763)
• P(H|D) = P(H) . P(D|H) / P(D)

• Fisher
• Shannon (c1948)
• msgLen(E) = -log2(P(E)) in optimal code, so  . . .
• msgLen(H&D) = msgLen(H) + msgLen(D|H)
= msgLen(D) + msgLen(H|D)
• msgLen(H) = model complexity
• msgLen(D|H) = data complexity | H

<< [05] >>

Model Transitions for 2 Classes

<< [06] >>

# Estimation (1)

• Given an optimal path of transitions through the model, can use frequencies to estimate transition probabilities etc.
(State to optimum degree of accuracy.)

• New estimates may change optimal path through model.

• Re-estimate, loop, . . . converges.

• (Still have to consider splitting/ merging classes.)
But this gives biased estimates of model parameters!

<< [07] >>

# Estimation (2) - the twist

• Class memberships of Observations are nuisance parameters if interested in best class description only.

• Solution: Sum probabilities of all paths through model.

• Legit': Different paths are exclusive sub-hypotheses.

more`...`

<< [08] >>

# Forward:

Consider all paths leading to   ok | cj:

F(o1 in ci) = P(ci).P(o1|ci)

F(ok in ci) = SUMj=1..N F(ok-1 in cj). P(ci|cj). P(ok|ci)
1 < k <= K

P(D|H) = SUMi=1..N F(oK in ci)

<< [09] >>

# Backwards

Consider all paths leading from   ok | cj:

B(oK in ci) = 1

B(ok = SUMj=1..N P(cj|ci). P(ok+1|cj). B(ok+1 in cj)
1 <= k < K

Now
P(D|H, ok in ci) = F(ok in ci). B(ok in ci).
and
P(D|H) = SUMi=1..N P(D|H, ok in ci)
for 1 <= k <= K

<< [10] >>

. . . this gives the probability that ok is in cj. i.e. We can deal with fractional membership of classes.
Results in unbiased estimates of class parameters and transition probabilities.

<< [11] >>

# How Many Classes?

Can split class ci into ci' and ci'', in ratio w:1-w, e.g. w=0.5. New transition prob's initialised:
next>
v.
curr'
`. . .`ci'`. . .`ci''`. . .`cj`. . .`
.
.
.

ci' P(ci|ci).w P(ci|ci).(1-w) P(cj|ci)
.
.
.

ci'' P(ci|ci).w P(ci|ci).(1-w) P(cj|ci)
.
.
.

cj P(ci|cj).w P(ci|cj).(1-w)
.
.
.

<< [12] >>

Can also merge two classes.

Splitting or merging takes place if there is a nett reduction in message length.

Good heuristic, fast.

Must converge.

Could get stuck in local optimum, but unlikely.

<< [13] >>
Tests, e.g. 2-classes:
varying class auto transition from 0.0 to 1.0

<< [14] >>
varying data set size,   t=0.8
1C: Snob, 1-class; 2C: Snob, 2-classes; M2C: Markov Model 2-classes.
M2C wins if log10(K)>2.3

<< [15] >>
e.g. protein dihedral angles:
 Left: 17 classes [big pic'] [paper] [Bioinformatics]

<< [16]

# Conclusion

• If there is sequential correlation, the algorithm can better infer class structure from less data because it models the correlation.

• Time complexity ~ length of sequence × #classes2

• T. Edgoose, L. Allison & D. L. Dowe. An MML Classification of Protein Sequences that knows about Angles and Sequences. Pacific Symposium on Biocomputing, PSB98, pp.585-596, 1998 [paper]

• Tim Edgoose & Lloyd Allison. Minimum Message Length Hidden Markov Modelling Data Compression Conf (DCC), pp.169-178, 1998 [paper]

• T. Edgoose & L. Allison. MML Markov classification of sequential data. Statistics and Computing 9, pp.269-278, 1999 [paper]

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1