^up^ [01] >>

# Unsupervised Classification

#### L. Allison, Computer Science & Software Engineering, Monash University, Australia 3800

Problem:   Given multivariate data, i.e. S things, each having D attributes, place the things in T classes (groups, kinds, species, ...) in a way that best reflects the similarities and differences between the things.

12...D
1 x1,1 x1,2 ... x1,D
2 x2,1 x2,2 ... x2,D
... ... ... ... ...
... ... ... ... ...
S xS,1 xS,2 ... xS,D

This document is online at   http://www.csse.monash.edu.au/~lloyd/Archive/2005-04-un-class/index.shtml   and contains hyper-links to other resources.

<< [02] >>

Must infer

• The number of classes (clusters),

• The relative abundances of the classes,

• the parameters of each class, and

• for each thing, the class that it (probably) belongs to, but see later!
NB. Class memberships are nuisance parameters if we are interested in the optimal class structure!

<< [03] >>

Variations and related problems

• Hierarchical classification:
• The classes form a tree structure of classes and sub-classes.

• Phylogenetic (Evolutionary) Trees:
• Form a tree (often a binary tree) where each leaf represents one thing and each fork-node represents a (hypothetical) ancestral thing.

• Supervised classification:
• Classes and memberships are given,
• must infer a classification function from input attributes to output attributes
• see decision (classification) trees.

<< [04] >>

Back to basic Unsupervised Classification: Must infer

• The number of classes (clusters),
• The relative abundances of the classes,
• the parameters of each class, and
• (with a qualification) for each thing, the class that it (probably) belongs to.

Simplest model: One class.

Most complex model: One class per thing.

Almost always, neither is sensible.

<< [05] >>

The required tools: The abilities to state

• Number of classes, i.e. a +ve integer, T

• Class proportions, P(classi), i=1..T, where SUM ci=1,   i.e. a multi-state probability distribution. Stated to a suitable accuracy.
And then, for each thing state its class.

• For each class, for each relevant attribute, the parameters of a suitable probability distribution. Stated to a suitable accuracy.
• e.g. Multistate distribution for a discrete unordered attribute.
• e.g. N(m,s) for a continuous valued attribute.
• other ?
And then, for each thing in the class, state the thing's attribute values.

<< [06] >>

Now see "mixture modelling".

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (IRIX)",   charset=iso-8859-1