Introduction to MML

home1 home2
and the

 Ind. Inf.

Inductive Inference is detective work. It is the business of trying to put order on a mass of data, i.e. coming up with an explanation for it. The data are often contradictory and contain measurement errors and other uncertainties. People do inference all of the time; it is one of the hall-marks of intelligence. Scientists do it when they form theories from experimental data. The trouble is that an unlimited number of theories could explain some given facts. Which is the best one?

William of Ockham (1285-1349) is often credited with saying something like, "if two theories explain the facts equally well then the simpler theory is to be preferred". This principle is known as Ockham's razor. The razor is rather vague, but it can be made precise and Minimum Message Length encoding (MML) does just that: 

P(H&D) = P(H).P(D|H) = P(D).P(H|D) - Bayes
msgLen(E) = -log2(P(E)) - Shannon
hence msgLen(H&D) = msgLen(H) + msgLen(D|H)
Bayes' theorem shows that the hypothesis, H, that best explains the data, D, i.e. the hypothesis with the largest posterior probability, P(H|D), is the one maximising P(H&D), and so minimising msgLen(H)+msgLen(D|H), hence the terminology `Minimum Message Length' after Shannon's `mathematical theory of communication' (1948). Note that prediction is a related but different problem to explanation and requires correctly weighting the predictions of many hypotheses to give optimal results. Techniques for producing practical machine learning programs that apply MML to many inference problems have been developed in Computer Science at Monash University over 30+ years. Some of these developments are described below.

[se al img gif (12K)]


[se al img gif (15K)]

Snob is a classification program. It relies on MML to form the best set of classes (groups, clusters, species, types, ...) to describe a given set of "things" (individuals, items, measurements, observations, ...).

The first application of Snob was to a collection of measurements taken from 217 seal skulls. Snob found seven classes which correspond well to the male and female groups of those species that were represented in reasonable numbers - remember that Snob knows nothing about biology and was not told the sex of the seals.

Since then, Snob has been applied to data from earth sciences, medicine, molecular biology, psychology and many other areas.

C. S. Wallace & D. M. Boulton. An information measure for classification. The Computer Journal, 11(2), pp.185-194, August 1968.
C. S. Wallace & P. R. Freeman. Estimation and inference by compact coding. Jrnl. Royal Stat. Soc., 49(3), pp.240-265, 1987

NB. The problem addressed is also known as unsupervised classification, (numerical) taxonomy, mixture modelling (in statistics), and clustering (in machine learning).

Sequences and Time Series

change point,
looks familiar,
what's next,

Sequence or (time-) series data may contain hidden structure (correlations). The MML classification method above has been extended to allow the `class' of a thing (item, measurement, observation, ...) to be influenced by other things, e.g. in a Markov model or a "repeats" model. The new methods allow the true class structure to be recovered on less data than before and indicates where and why class membership (probably) changed. (A thing can have many attributes; an attribute can be drawn from normal (Gaussian), Poisson, von-Mises, etc. distributions.)

See: T. Edgoose & L. Allison. MML Markov classification of sequential data. Stats. and Comp., 9(4), pp.269-278, 1999.

The question of what is a good model for sequential data from an unknown source is relevant to data compression, inductive inference, and prediction. Lempel-Ziv based models (1976) asymptotically converge (in performance) to a wide class of models and this gives them a claim to be a good model to use on an unknown source; however, the convergence is slow and can require a large amount of data. The approximate-repeats model converges more rapidly on DNA and other difficult data and can be used to find better explanations of the data:

See: L.Allison, T.Edgoose & T. I. Dix. Compression of strings with approximate repeats. Intell. Sys. in Mol. Biol, pp.8-16, 1998.

There are many measures for the similarity of two sequences: longest common subsequence (LCS), edit-distance, time-warps, etc.. These measures generally assume that each sequence (series) is random, except possibly for a bias in the distribution of observations, i.e. each observation is assumed to be independent of the others. It has been shown how to incorporate a wide class of "models" of sequences into similarity measures:

D. R. Powell, L. Allison, T. I. Dix, D. L. Dowe. Alignment of low information sequences. Aust. Comp. Sci. Theory Symp. '98, pp.215-230, Springer Verlag, 1998.
L.Allison, D. Powell & T. I. Dix. Compression and Approximate Matching. Computer Jrnl. 42(1), pp.1-10, 1999.
D. R. Powell, L. Allison & T. I. Dix. Modelling alignment for non-random sequences. 17th ACS Australian Joint Conf. on AI (AI2004), Springer-Verlag, LNCS/LNAI 3339, pp.203-214, 2004.

Decision Trees and Graphs

expert system,
rule extraction,
simple v. complex,
supervised learning,

A decision tree represents a certain type of decision-function, based on tests on the attributes (properties, variables, ...) of a thing (case, item, observation, ...). The tree structure can be drawn and examined, and it represents a kind of explanation of the data. Decision trees are used in expert systems to "learn" how a human expert does what she or he does well, given a training set of examples.

There is a very large number of possible decision trees for a given data set, and MML can be used to choose a best one, balancing the size or complexity of the tree against the accuracy of its predictions and thus avoiding the well-known problem of over-fitting (i.e. fitting the noise in the data). Wallace and Patrick showed how to improve on previous attempts to judge this trade-off.

See: C. S. Wallace & J. D. Patrick. Coding decision trees. Machine Learning, 11, pp.7-22, 1993.

Decision Graphs are generalisations of decision trees. They model the same class of decision functions, but can express disjunctive conditions ("or") much more economically. This often means that a better inference can be made given less data, and that the resulting graph is easier to visualise than the corresponding tree.

See: J. Oliver. Decision graphs - an extension of decision trees. 4th Int. Conf on Artificial Intelligence, pp.343-359, 1993.

NB. A decision tree (graph) is more correctly known as a classification tree (graph) and the problem addressed is then called supervised classification.

Causal Models, Bayesian Networks

cause & effect,
why is it so?

Causal models (Bayesian networks) are used by medical, biological and social scientists to explain a large variety of phenomena, e.g. does smoking cause lung cancer, or does increased public spending on education lead to a long-term increase in national wealth?

Bayesian networks represent the probabilistic relationships between variables (attributes) in graph-based models. They can be used to represent any joint probability function. They are particularly useful when the direct (causal) relationships between variables are sparse, because the update procedures for accomodating new observations are efficient. Approximation inference techniques are available for complex, non-sparse domains. Dynamic Bayesian networks can be used to model relationships that are dependent on previous states of the model. Example applications include diagnosis of telecommunication faults, pipeline monitoring and control, and medical diagnosis. Bayesian networks are beginning to be widely used in expert systems as an alternative to neural networks.

MML is used to cost the models and perform a trade-off between model complexity and accuracy of fit to the observational data. Both real-valued and discrete-valued models can be discovered using one of the following search strategies: greedy algorithms, genetic algorithms, stochastic sampling.

See: C. S. Wallace & K. B. Korb. Learning causal models by MML sampling. In Causal Models and Intelligent Data Management, A. Gammerman (ed), Springer Verlag, 1998.
R. T. O'Donnell, L. Allison & K. B. Korb. Learning Hybrid Bayesian Networks by MML. Springer Verlag, LNCS Vol.4304, pp.192-203, 2006.

[st0ne circle, jpg (11K)] Megalithic Stone Circles

There are dozens of stone circles in the British Isles. It turns out that most are not perfectly circular and elaborate theories have been proposed to explain the deviations from circularity. The question is, are these elaborate theories correct, or are the "circles" simply inaccurate - made by ancient peoples with rough measuring instruments on rough ground? MML comes down firmly in favour of the latter - they are just rough circles.

J. D. Patrick. An Information Measure Comparative Analysis of Megalithic Geometries. PhD Thesis, Computer Science, Monash University, 1979.
J. D. Patrick & C. S. Wallace. Stone circle geometries: an information theory approach. in Archaeoastronomy in the Old World, D.Heggie (ed), C.U.P., 1982.

Hand of cards gif (10K) Bayesian Poker Player

Poker is a good Inductive Inference problem. You only have partial information and your opponents are actively trying to deceive you. Are they bluffing? It is easy to calculate the probability that a good hand or a bad hand was dealt to someone. But your opponents know that you can do that. You know that they know that you can do that, and ......

The Poker project is a test-bed for machine learning methods such as Bayesian networks.

(Bayes published `An Essay Towards Solving a Problem in the Doctrine of Chances' in 1763.)

Molecular Biology

Molecular-Biology data contain experimental error and random "noise" which make analysis difficult. MML can deal with error and noise and, for example, has been applied to classifying dihedral angles in proteins to throw light on typical protein structures. Understanding protein structure is important in medicine and drug design.

[phi v. psi graph]

Phi and psi are the two free dihedral-angles per amino-acid on a protein backbone. The peaks represent the centres of classes found by the program.

See: T. Edgoose, L. Allison & D. L. Dowe. An MML classification of protein structure that knows about angles and sequences. Pacific Symposium on Biocomputing '98, pp.585-596, January 1998.

MML has been used in other computer programs for Molecular Biology.

Computer Science

The MML research group carries out research into Inductive Inference. Computer Science units in Computer Programming, Algorithms and Data Structures, Foundations of C.S. and Artificial Intelligence give a good background for this work.

Also see C.S.Wallace's publications.

-- LA 1995--2005
Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

free op. sys.
free office suite
~ free photoshop
web browser

Also see:

© L. Allison   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Wednesday, 26-Jun-2024 08:28:21 AEST.