Minimum Message Length, MML

home1 home2
and the

 KL dist'
 Ideal Gas

also see:

Inductive Inference and Machine Learning by Minimum Message Length (MML) encoding.

Also see models & parameters, William of Ockham, Thomas Bayes, R. A. Fisher, and C. S. Wallace.
There is a non-Technical Introduction [here].


See also:
  • C. S. Wallace and M. P. Georgeff. A General Objective for Inductive Inference. TR 32, Dept. Computer Science, Monash University, March 1983.
  • J. J. Oliver and D. J. Hand, Introduction to Minimum Encoding Inference, TR 4-94, Dept. Stats. Open Univ. and also TR 94/205 Dept. Comp. Sci. Monash Univ.
  • J. J. Oliver and R. A. Baxter, MML and Bayesianism: Similarities and Differences, TR 94/206.
  • R. A. Baxter and J. J. Oliver, MDL and MML: Similarities and Differences, TR 94/207
  • The Computer Journal, special issue 42(4), 1999, includes:
    • C. S. Wallace & D. L. Dowe, Minimum Message Length and Kolmogorov Complexity, pp.270-283, also ...
    • Refinements of MDL and MML Coding, pp.330-337
    • and discussion on MML and MDL by various authors.
  • G. E. Farr & C. S. Wallace, The complexity of strict minimum message length inference, Computer Journal 45(3) pp.285-292 2002 [link].
  • C. S. Wallace, A Brief History of MML, 20/11/2003.
  • L. Allison, Models for machine learning and data mining in functional programming, J. Functional Programming (JFP), 15(1), pp.15-32, doi:10.1017/S0956796804005301, Jan. 2005.
  • C. S. Wallace, Statistical & Inductive Inference by MML, Springer, Information Science and Statistics, isbn:038723795X, 2005.
  • L. Allison, Coding Ockham's Razor, Springer, isbn13:978-3319764320, doi:10.1007/978-3-319-76433-7, 2018.

For a hypothesis H and data D we have from Bayes:
P(H&D) = P(H).P(D|H) = P(D).P(H|D)

  • P(H), prior probability of hypothesis H
  • P(H|D), posterior probability of hypothesis H
  • P(D|H), likelihood of the hypothesis, actually a function of the data given H.

From Shannon's Mathematical Theory of Communication (1949) we know that in an optimal code, the message length of an event E, MsgLen(E), where E has probability P(E), is given by MsgLen(E) = -log2(P(E)):

MsgLen(H&D) = MsgLen(H)+MsgLen(D|H) = MsgLen(D)+MsgLen(H|D)

Now in inductive inference one often wants the hypothesis H with the largest posterior probability. MsgLen(H) can usually be estimated well, for some reasonable prior on hypotheses. MsgLen(D|H) can also usually be calculated. Unfortunately it is often impractical to estimate P(D) which is a pity because it would yield P(H|D). However, for two rival hypotheses, H and H'

MsgLen(H|D)-MsgLen(H'|D) = MsgLen(H)+MsgLen(D|H) - MsgLen(H')-MsgLen(D|H') = posterior -log odds ratio

Consider a transmitter T and a receiver R connected by one of Shannon's communication channels. T must transmit some data D to R. T and R may have previously agreed on a code book for hypotheses, using common knowledge and prior expectations. If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit the data then she may be able to transmit the data economically. An explanation is a two part message:
(i) transmit H taking MsgLen(H) bits, and (ii) transmit D given H taking MsgLen(D|H) bits.

The message paradigm keeps us "honest": Any information that is not common knowledge must be included in the message for it to be decipherable by the receiver; there can be no hidden parameters.
This issue extends to inferring (and stating) real-valued parameters to the "appropriate" level of precision.
The method is "safe": If we use an inefficient code it can only make the hypothesis look less attractive than otherwise.
There is a natural hypothesis test: The null-theory corresponds to transmitting the data "as is". (That does not necessarily mean in 8-bit ascii; the language must be efficient.) If a hypothesis cannot better the null-theory then it is not acceptable.

A more complex hypothesis fits the data better than a simpler model, in general. We see that MML encoding gives a trade-off between hypothesis complexity, MsgLen(H), and the goodness of fit to the data, MsgLen(D|H). The MML principle is one way to justify and realise Occam's razor.

Continuous Real-Valued Parameters

When a model has one or more continuous, real-valued parameters they must be stated to an "appropriate" level of precision. The parameter must be stated in the explanation, and only a finite number of bits can be used for the purpose, as part of MsgLen(H). The stated value will often be close to the maximum-likelihood value which minimises MsgLen(D|H). If the -log likelihood, MsgLen(D|H), varies rapidly for small changes in the parameter, the parameter should be stated to high precision. If the -log likelihood varies only slowly with changes in the parameter, the parameter should be stated to low precision.

The simplest case is the multi-state or multinomial distribution where the data is a sequence of independent values from such a distribution. The hypothesis, H, is an estimate of the probabilities of the various states (eg. the bias of a coin or a dice). The estimate must be stated to an "appropriate" precision, ie. in an appropriate number of bits.

Applications and Related Areas

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

free op. sys.
free office suite
~ free photoshop
web browser

Statistical and Inductive Inference by Minimum Message Length, C. S. Wallace, Springer


© L. Allison   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Thursday, 23-May-2024 06:56:47 AEST.