^up^ [01] >>

# Mixture Modelling

#### On work by Chris Wallace, David Boulton, Rohan Baxter, David Dowe, Jon Oliver and others,   Lloyd Allison 24/10/2001.

This document is online at   http://www.csse.monash.edu.au/~lloyd/Archive/2005-06-Mixture/index.shtml   and contains hyper-links to other resources.

Unsupervised classification = clustering = mixture modelling = . . .

Given multivariate data, i.e. S things, each having D attibutes, infer:

• The number of classes (clusters),

• the parameters of each class, and

• for each thing, the class that it (probably) belongs to, but see later!   [more]
For a simple univariate example . . .

<< [02] >>
 Needs Sun Microsystems' Java ON! ©L.Allison  (au) mu1= sigma1= p= mu2= sigma2= q=(1-p) vertical: min= max=

<< [03] >>

Use the form above to experiment. Try

• separation greater than, equal to, and less than, 2 standard deviations,

• equal & unequal proportions of the two Normals,

• equal means but differing variances,

• etc.

<< [04] >>

Can have mixtures of

• any appropriate distribution - [Normal], [Poisson], [Multi-state], etc.,

• any number of classes, i.e. component distributions,

• heterogeneous distributions (over the same data space), e.g. Poisson + Geometric

But the more general the model, the more difficult the search problem.

<< [05] >>

Bi-variate example

The two attributes, phi and psi, are the two free dihedral-angles per amino-acid on a protein backbone. The peaks represent the centres of classes found.

<< [06] >>

# Encoding the Hypothesis

• Number of classes using some suitable [Integer] distribution,

• the relative ambundances of each class; this is a [Multi-state] distribution, [1 . . #classes],

• the class parameters for each class - e.g. mu, sigma for a [Normal] or whatever.

• and then the data . . .

<< [07] >>

# Data given Hypothesis

e.g.   x | 2-classes:
x has a certain measurement accuracy, an interval.
NB. If interval is small, probability density can be approximated as a constant over it, and simply "passes through" the Maths.

<< [08] >>

# Total assignment (of x)

x is most likely in the right hand class; code length = -log2(shaded)

<< [09] >>

membership of the left hand class is less likely but conceivable

<< [10] >>

Fractional assignment, to both classes, loses no probability:

<< [11] >>

The probability contributions of two classes, green and shaded:

<< [12] >>

Total assignment leads to biased estimates of parameters (means, variances):

by effectively transferring areas (unequal in general) between classes. Means move apart, variances reduce.

<< [13] >>

# Coding Data

There are two ways of viewing/ implementing fractional assignment:

1. Borrowed bits, Wallace (1986).

e.g. consider a case of 50:50 membership,

• work out code for rest of data,
• "borrow" 1st bit from rest to decide which membership choice to make,
• remove borrowed bit from rest, don't transmit it, receiver can deduce it!

Generalizes to unequal ratios and >2 classes.

2. Directly from a code-book based on the combined, i.e. mixed, distribution.
Note that class memberships are nuisance parameters if we want (only) the class descriptions. And typically of nuisance parameters, are proportional to |input|. Fractional assignment = no nuisance.

<< [14] >>

# Search

The methods described allow two hypotheses to be compared and the better one chosen, but there remains the search problem - to find the best mixture model. `Snob' does the following:

• Have a current working hypothesis,

• iterate, re-estimating memberships & class parameters.

• Periodically, consider merging two classes, or splitting a class, accept if total message length is reduced.

• Repeat until   MsgLen(Hypothesis) + MsgLen(Data|Hypothesis)   stops reducing.
Must converge, possibly to a local optimum, works well in practice.

<< [15]

# Conclusion

• Minimum Message Length (MML) shows how to trade-off model complexity (# of classes) against fit to data.

• Fractional assignment gives unbiased estimates of class parameters and of # of classes (clusters) and,

• given enough data, can find true model regardless of closeness of classes, even equal means (& unequal variances).

### References

• C. S. Wallace & D. M. Boulton. An Information Measure for Classification. Computer Journal 11(2) pp185-194 August 1968.
• C. S. Wallace. An improved program for classification. Proc. Australian Comp. Sci. Conf ACSC9, 8(1), pp357-366, February 1986
- DLD: First description of the bit-borrowing coding technique, also see ICCI'90, later called bits-back by G. E. Hinton & D. van Camp in COLT '93, (COLT93) '93.
• C. S. Wallace & P. R. Freeman. Estimation and Inference by Compact Encoding. J. R. Stat. Soc. B 49 pp240-265 1987 [paper]
• C. S. Wallace. Classification by minimum-message-length encoding. Advances in Computing and Information - ICCI '90, Springer-Verlag, LNCS 468, pp72-81, May 1990
• On the protein example:

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux)",   charset=iso-8859-1