Minimum Message Length, MML

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

MML
Basics
Glossary
Discrete
Continuous
Structured
SMML
KL dist'
"Art"
Ideal Gas

also see:
II
  ACSC03
  JFP05
  ACSC06
AI06
ActaOe.07
CSW
History
Bioinf.

Inductive Inference and Machine Learning by Minimum Message Length (MML) encoding.

Also see models & parameters, William of Ockham, Thomas Bayes, R. A. Fisher, and C. S. Wallace.
There is a non-Technical Introduction [here].

Introduction

See also:

C. S. Wallace and M. P. Georgeff. A General Objective for Inductive Inference. TR 32, Dept. Computer Science, Monash University, March 1983.
J. J. Oliver and D. J. Hand, Introduction to Minimum Encoding Inference, TR 4-94, Dept. Stats. Open Univ. and also TR 94/205 Dept. Comp. Sci. Monash Univ.
J. J. Oliver and R. A. Baxter, MML and Bayesianism: Similarities and Differences, TR 94/206.
R. A. Baxter and J. J. Oliver, MDL and MML: Similarities and Differences, TR 94/207
The Computer Journal, special issue 42(4), 1999, includes:
- C. S. Wallace & D. L. Dowe, Minimum Message Length and Kolmogorov Complexity, pp.270-283, also ...
- Refinements of MDL and MML Coding, pp.330-337
- and discussion on MML and MDL by various authors.
G. E. Farr & C. S. Wallace, The complexity of strict minimum message length inference, Computer Journal 45(3) pp.285-292 2002 [link].
C. S. Wallace, A Brief History of MML, 20/11/2003.
L. Allison, Models for machine learning and data mining in functional programming, J. Functional Programming (JFP), 15(1), pp.15-32, doi:10.1017/S0956796804005301, Jan. 2005.
C. S. Wallace, Statistical & Inductive Inference by MML, Springer, Information Science and Statistics, isbn:038723795X, 2005.
L. Allison, Coding Ockham's Razor, Springer, isbn13:978-3319764320, doi:10.1007/978-3-319-76433-7, 2018.

For a hypothesis H and data D we have from Bayes:
P(H&D) = P(H).P(D|H) = P(D).P(H|D)

P(H), prior probability of hypothesis H
P(H|D), posterior probability of hypothesis H
P(D|H), likelihood of the hypothesis, actually a function of the data given H.

From Shannon's Mathematical Theory of Communication (1949) we know that in an optimal code, the message length of an event E, MsgLen(E), where E has probability P(E), is given by MsgLen(E) = -log₂(P(E)):

MsgLen(H&D) = MsgLen(H)+MsgLen(D|H) = MsgLen(D)+MsgLen(H|D)

Now in inductive inference one often wants the hypothesis H with the largest posterior probability. MsgLen(H) can usually be estimated well, for some reasonable prior on hypotheses. MsgLen(D|H) can also usually be calculated. Unfortunately it is often impractical to estimate P(D) which is a pity because it would yield P(H|D). However, for two rival hypotheses, H and H'

MsgLen(H|D)-MsgLen(H'|D) = MsgLen(H)+MsgLen(D|H) - MsgLen(H')-MsgLen(D|H') = posterior -log odds ratio

Consider a transmitter T and a receiver R connected by one of Shannon's communication channels. T must transmit some data D to R. T and R may have previously agreed on a code book for hypotheses, using common knowledge and prior expectations. If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit the data then she may be able to transmit the data economically. An explanation is a two part message:
(i) transmit H taking MsgLen(H) bits, and (ii) transmit D given H taking MsgLen(D|H) bits.

The message paradigm keeps us "honest": Any information that is not common knowledge must be included in the message for it to be decipherable by the receiver; there can be no hidden parameters.
This issue extends to inferring (and stating) real-valued parameters to the "appropriate" level of precision.
The method is "safe": If we use an inefficient code it can only make the hypothesis look less attractive than otherwise.
There is a natural hypothesis test: The null-theory corresponds to transmitting the data "as is". (That does not necessarily mean in 8-bit ascii; the language must be efficient.) If a hypothesis cannot better the null-theory then it is not acceptable.

A more complex hypothesis fits the data better than a simpler model, in general. We see that MML encoding gives a trade-off between hypothesis complexity, MsgLen(H), and the goodness of fit to the data, MsgLen(D|H). The MML principle is one way to justify and realise Occam's razor.

Continuous Real-Valued Parameters

When a model has one or more continuous, real-valued parameters they must be stated to an "appropriate" level of precision. The parameter must be stated in the explanation, and only a finite number of bits can be used for the purpose, as part of MsgLen(H). The stated value will often be close to the maximum-likelihood value which minimises MsgLen(D|H). If the -log likelihood, MsgLen(D|H), varies rapidly for small changes in the parameter, the parameter should be stated to high precision. If the -log likelihood varies only slowly with changes in the parameter, the parameter should be stated to low precision.

The simplest case is the multi-state or multinomial distribution where the data is a sequence of independent values from such a distribution. The hypothesis, H, is an estimate of the probabilities of the various states (eg. the bias of a coin or a dice). The estimate must be stated to an "appropriate" precision, ie. in an appropriate number of bits.

Applications and Related Areas

Basics: Probability, and Information, and Coding.
Bayesian inference.
Discrete: Including Multinomial and binomial, or multi-state distributions, eg., in strings, sequences and finite state automata. Integers.
Continuous: Including the Normal probability distribution, factor analysis, and von Mises (circular) probability distribution.
Structured: Including HMMs (Hidden Markov Models), Decision-Trees and -Graphs
(also [Bib] search for Dtree, or Dgraph, or Megalithic stone circle).

Classification, clustering or mixture modelling, including a short note on Snob.
(Also J.D.Patrick, SNOB: A program for discriminating between classes, [TR 91/151] with [abstract].)

Bayesian Networks:
Bayesian networks,
Mixed Bayesian Networks ACSC2006 on structured models, BNs, continuous and discrete variables,
CaMML Hybrid Bayesian networks (2006) on local structure.
Causal modelling.
Linear regression, curve and line fitting.
Machine learning.
Bioinformatics -- Molecular Biology applications, eg. string or sequence analysis & alignment, multiple alignment and evolutionary trees.

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

Statistical and Inductive Inference by Minimum Message Length, C. S. Wallace, Springer

CSE454

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Saturday, 27-Jul-2024 16:03:44 AEST.