CSE455 Learning and Prediction II: MML Data Mining

Lecturer: Dr David L. Dowe.

**Plan**

The half-semester course is to follow on the heels of its main pre-requisite,
CSE454.
(CSE454 plan
and
CSE454's progress in 2002.)

(A plan for CSE454
and an
initial plan for
CSE455.)

Foundations of inductive inference from (algorithmic) information theory;
intermediate to advanced Minimum Message Length (MML) inference;
details of Fisher information and uncertainty regions;
angular/circular models (von Mises, Wrapped Normal or trigonometric);
Poisson distribution;
MML of specific models such as decision graphs, hidden Markov models
(or HMMs, also known as probabilistic finite state automata, or PFSAs),
linear and polynomial regression, causal models, Bayesian nets,
time series, sequences, segmentation, trends;
probabilistic prediction and Kullback-Leibler distance.
Statistical invariance, statistical consistency.
Data mining.
Additional models may include factor analysis and additional theory may
include the Neyman-Scott problem (1948).
Applications to be considered may include:
models of protein folding and protein structure prediction,
bushfire prediction,
text and image analysis,
DNA alignment and the human genome
project, authorship identification for texts, etc. Further typical
applications may be described.

Might also manage to fit in some or all of:
polygon modelling;
DNA pattern discovery and alignment,
evolutionary trees;
Lempel-Ziv text compression, C.S. Wallace improvement (1989, 1996),
approximate repeats; HMMs (PFSAs) in mixture modelling;
Markov fields, images.

**Progress**

The half-semester course followed on the heels of its main pre-requisite,
CSE454,
whose 7th and last lecture was on Tuesday 23rd April 2002.

I gave an informal, non-examinable, 30 min. introduction to CSE455 at the end of the CSE454 revision lecture.

Admin., introduction to and outline of course, Gaussian/Normal distribution (re-cap), von Mises distribution (outline and flaunting of empirical results),

description of Wrapped Normal (WN) distribution.

Handed out Ass't 1 (due at G.O., 12:00 noon, Wedn. 22/5), Fisher information (in some detail) and Wallace-Freeman (1987) message length approximation,

introduction to decision graphs.

Discussion of Ass't 1 (still due at G.O., 12:00 noon, Wedn. 22/5), cut-points for (binary) cuts in decision trees, lookahead and search heuristics in decision trees,

sketch of (Oliver-Wallace binary-join) decision graphs, application of dtree to bushfire prediction, application of dgraph to protein secondary structure (E/H/O) prediction,

re-capping on Kullback-Leibler distance (probabilistic and Gaussian footy-tipping competitions, probabilistic scoring and Gaussian scoring).

Further discussion of Ass't 1 (now due at G.O., 12:00 noon, Fri. 24/5); statistical invariance,

statistical consistency (Neyman-Scott problem and likelihood function, statistical inconsistency in mixture modelling, loose outline of statistical inconsistency in factor analysis);

outline of two-part Kolmogorov complexity and MML (Wallace and Dowe, Comp. J., Vol 42, No. 4 (1999a), pp270-283);

(hurried through) univariate linear and polynomial regression (Viswanathan-Wallace (1999); Fitzgibbon-Dowe-Allison, ICML'2002).

Handed out Ass't 2 (due at G.O., 12:00 noon, Wedn. 12/6) and discussed at length (including more on Kullback-Leibler distance(s));

thorough re-cap on univariate polynomial regression (Viswanathan-Wallace (1999); Fitzgibbon-Dowe-Allison, ICML'2002);

briefly outlined MML Bayesian/``causal'' networks; briefly began on Probabilistic Finite State Automata (PFSAs), a.k.a. Hidden Markov Models (HMMs).

Reminder re measurement accuracy. Discussion of Ass't 2 (now due at G.O., 12:00 noon, Fri. 14/6). Detailed continuation of Probabilistic Finite State Automata (PFSAs), a.k.a. Hidden Markov Models (HMMs). Short mention of time-series binary cut-point problem. Hand-out of practice exam.

Final exam scheduled for Wed. 19/6 a.m., lecture theatre S5 (bldg. 24), 9:30- either 11:00 (90 mins) or 11:30 (120 mins).

Revision of practice exam and other questions, if desired. I ended up doing some consulation in my office.

Final exam scheduled for Wed. 19/6 a.m., lecture theatre S5 (bldg. 24), 9:30- either 11:00 (90 mins) or 11:30 (120 mins).

**Consultation**

Tuesdays 12-1; Thursdays 12-1.

**Reference(s)**

C.S. Wallace
and
D.L. Dowe
(1999a),
"Minimum
Message Length and Kolmogorov complexity",
Comp. J.,
Vol.
42, No. 4,
pp270-283.

**Field of education code(s)**

020119 Artificial Intelligence, 010103 Statistics.

**Links**

Minimum Message Length
(MML),
Occam's razor,
and
data collections.

I taught
CSC423
Learning and Prediction
on MML from 1997-2001
(4 points, then 6 points)
before it was split into
CSE454
(3 points)
and
CSE455
Learning and Prediction II:
MML Data Mining
(3 points).

I am co-ordinating
CSE50DM:
Statistics of data and data mining,
part of the
Graduate Certificate in Computational Science.

MML talks
and
CSSE
Clayton
seminars.

MML Bayes Nets
with Decision Trees.

MML Decision Trees.

Ockham's razor.

Pi
= 3.141592653589793....

Probabilistic football-tipping competition
(and
outline of
probabilistic
prediction),
free with
prizes for some
secondary students.
World longest's running and Australia's
first.

Snob program for
MML
mixture modelling and clustering.

Short course
slides.

Semester
dates.

Some useful links,
TheBreastCancerSite,
TheHungerSite,
TheRainforestSite,
www.GiveWater.org,
"do-goody"/"improving the world".

Copyright
Dr David L. Dowe,
Monash University, Australia,
9th May 2002, etc.

Link to 2006 CSE455 courseware notes.

Copying is not permitted without expressed permission from
David L. Dowe.