CSE455 Learning and Prediction II: MML Data Mining


CSE455 Learning and Prediction II: MML Data Mining
Lecturer: Dr David L. Dowe.

Plan
The half-semester course is to follow on the heels of its main pre-requisite, CSE454. (CSE454 plan and CSE454's progress in 2002.)
(A plan for CSE454 and an initial plan for CSE455.)
Foundations of inductive inference from (algorithmic) information theory; intermediate to advanced Minimum Message Length (MML) inference; details of Fisher information and uncertainty regions; angular/circular models (von Mises, Wrapped Normal or trigonometric); Poisson distribution; MML of specific models such as decision graphs, hidden Markov models (or HMMs, also known as probabilistic finite state automata, or PFSAs), linear and polynomial regression, causal models, Bayesian nets, time series, sequences, segmentation, trends; probabilistic prediction and Kullback-Leibler distance. Statistical invariance, statistical consistency. Data mining. Additional models may include factor analysis and additional theory may include the Neyman-Scott problem (1948). Applications to be considered may include: models of protein folding and protein structure prediction, bushfire prediction, text and image analysis, DNA alignment and the human genome project, authorship identification for texts, etc. Further typical applications may be described.

Might also manage to fit in some or all of: polygon modelling; DNA pattern discovery and alignment, evolutionary trees; Lempel-Ziv text compression, C.S. Wallace improvement (1989, 1996), approximate repeats; HMMs (PFSAs) in mixture modelling; Markov fields, images.


Progress
The half-semester course followed on the heels of its main pre-requisite, CSE454, whose 7th and last lecture was on Tuesday 23rd April 2002.

  • Week 0, Tuesday 23rd April 2002
    I gave an informal, non-examinable, 30 min. introduction to CSE455 at the end of the CSE454 revision lecture.

  • Week 1, Tuesday 30th April 2002
    Admin., introduction to and outline of course, Gaussian/Normal distribution (re-cap), von Mises distribution (outline and flaunting of empirical results),
    description of Wrapped Normal (WN) distribution.

  • Week 2, Tuesday 7th May 2002
    Handed out Ass't 1 (due at G.O., 12:00 noon, Wedn. 22/5), Fisher information (in some detail) and Wallace-Freeman (1987) message length approximation,
    introduction to decision graphs.

  • Week 3, Tuesday 14th May 2002
    Discussion of Ass't 1 (still due at G.O., 12:00 noon, Wedn. 22/5), cut-points for (binary) cuts in decision trees, lookahead and search heuristics in decision trees,
    sketch of (Oliver-Wallace binary-join) decision graphs, application of dtree to bushfire prediction, application of dgraph to protein secondary structure (E/H/O) prediction,
    re-capping on Kullback-Leibler distance (probabilistic and Gaussian footy-tipping competitions, probabilistic scoring and Gaussian scoring).

  • Week 4, Tuesday 21st May 2002
    Further discussion of Ass't 1 (now due at G.O., 12:00 noon, Fri. 24/5); statistical invariance,
    statistical consistency (Neyman-Scott problem and likelihood function, statistical inconsistency in mixture modelling, loose outline of statistical inconsistency in factor analysis);
    outline of two-part Kolmogorov complexity and MML (Wallace and Dowe, Comp. J., Vol 42, No. 4 (1999a), pp270-283);
    (hurried through) univariate linear and polynomial regression (Viswanathan-Wallace (1999); Fitzgibbon-Dowe-Allison, ICML'2002).

  • Week 5, Tuesday 28th May 2002
    Handed out Ass't 2 (due at G.O., 12:00 noon, Wedn. 12/6) and discussed at length (including more on Kullback-Leibler distance(s));
    thorough re-cap on univariate polynomial regression (Viswanathan-Wallace (1999); Fitzgibbon-Dowe-Allison, ICML'2002);
    briefly outlined MML Bayesian/``causal'' networks; briefly began on Probabilistic Finite State Automata (PFSAs), a.k.a. Hidden Markov Models (HMMs).

  • Week 6, Tuesday 4th June 2002 (still to come)
    Reminder re measurement accuracy. Discussion of Ass't 2 (now due at G.O., 12:00 noon, Fri. 14/6). Detailed continuation of Probabilistic Finite State Automata (PFSAs), a.k.a. Hidden Markov Models (HMMs). Short mention of time-series binary cut-point problem. Hand-out of practice exam.
    Final exam scheduled for Wed. 19/6 a.m., lecture theatre S5 (bldg. 24), 9:30- either 11:00 (90 mins) or 11:30 (120 mins).

  • Week 7, Tuesday 11th June 2002
    Revision of practice exam and other questions, if desired. I ended up doing some consulation in my office.

  • Final exam, Wednesday 19th June 2002, 9:30a.m., Lecture theatre S5 (bldg. 24) (still to come)
    Final exam scheduled for Wed. 19/6 a.m., lecture theatre S5 (bldg. 24), 9:30- either 11:00 (90 mins) or 11:30 (120 mins).

    Consultation
    Tuesdays 12-1; Thursdays 12-1.

    Reference(s)
    C.S. Wallace and D.L. Dowe (1999a), "Minimum Message Length and Kolmogorov complexity", Comp. J., Vol. 42, No. 4, pp270-283.

    Field of education code(s)
    020119 Artificial Intelligence, 010103 Statistics.

    Links
    Minimum Message Length (MML), Occam's razor, and data collections.
    I taught CSC423 Learning and Prediction on MML from 1997-2001 (4 points, then 6 points) before it was split into CSE454 (3 points) and CSE455 Learning and Prediction II: MML Data Mining (3 points).
    I am co-ordinating CSE50DM: Statistics of data and data mining, part of the Graduate Certificate in Computational Science.
    MML talks and CSSE Clayton seminars.
    MML Bayes Nets with Decision Trees.
    MML Decision Trees.
    Ockham's razor.
    Pi = 3.141592653589793....
    Probabilistic football-tipping competition (and outline of probabilistic prediction), free with prizes for some secondary students. World longest's running and Australia's first.
    Snob program for MML mixture modelling and clustering.
    Short course slides.
    Semester dates.
    Some useful links, TheBreastCancerSite, TheHungerSite, TheRainforestSite, www.GiveWater.org, "do-goody"/"improving the world".

  • This page, http://www.csse.monash.edu.au/~dld/tilde/CSE4/CSE455/2002, was last updated no earlier than 9th May 2002.

    Copyright Dr David L. Dowe, Monash University, Australia, 9th May 2002, etc.

    Link to 2006 CSE455 courseware notes.

    Copying is not permitted without expressed permission from David L. Dowe.