Inductive Inference and Machine Learning by Minimum Message Length (MML) encoding.
models & parameters,
William of Ockham,
R. A. Fisher, and
C. S. Wallace.
There is a non-Technical Introduction
- C. S. Wallace and M. P. Georgeff.
A General Objective for Inductive Inference.
Dept. Computer Science,
- J. J. Oliver and D. J. Hand,
Introduction to Minimum Encoding Inference,
Dept. Stats. Open Univ. and also
TR 94/205 Dept. Comp. Sci. Monash Univ.
- J. J. Oliver and R. A. Baxter,
MML and Bayesianism: Similarities and Differences,
- R. A. Baxter and J. J. Oliver,
MDL and MML: Similarities and Differences,
- The Computer Journal, special issue 42(4), 1999, includes:
- C. S. Wallace & D. L. Dowe,
Minimum Message Length and Kolmogorov Complexity,
- Refinements of MDL and MML Coding, pp.330-337
- and discussion on MML and MDL by various authors.
- G. E. Farr & C. S. Wallace,
The complexity of strict minimum message length inference,
Computer Journal 45(3) pp.285-292 2002
- C. S. Wallace,
A Brief History of MML, 20/11/2003.
- L. Allison,
Models for machine learning and data mining in functional programming,
J. Functional Programming (JFP), 15(1), pp.15-32,
- C. S. Wallace,
Statistical & Inductive Inference by MML,
Springer, Information Science and Statistics,
- L. Allison,
Coding Ockham's Razor,
Springer, isbn13:978-3319764320, 2018.
For a hypothesis H and data D we have from Bayes:
P(H&D) = P(H).P(D|H) = P(D).P(H|D)
- P(H), prior probability of hypothesis H
- P(H|D), posterior probability of hypothesis H
- P(D|H), likelihood of the hypothesis,
actually a function of the data given H.
From Shannon's Mathematical Theory of Communication (1949) we know
that in an optimal code, the message length of an event E, MsgLen(E),
where E has probability P(E), is given by
MsgLen(E) = -log2(P(E)):
Now in inductive inference one often wants the hypothesis H
with the largest posterior probability.
MsgLen(H) can usually be estimated well,
for some reasonable prior on hypotheses.
MsgLen(D|H) can also usually be calculated.
Unfortunately it is often impractical to estimate P(D)
which is a pity because it would yield P(H|D).
However, for two rival hypotheses, H and H'
= posterior -log odds ratio
a transmitter T and a receiver R
connected by one of Shannon's communication channels.
T must transmit some data D to R.
T and R may have previously agreed on a code book for hypotheses,
using common knowledge and prior expectations.
If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit
the data then she may be able to transmit the data economically.
An explanation is a two part message:
(i) transmit H taking MsgLen(H) bits, and
(ii) transmit D given H taking MsgLen(D|H) bits.
The message paradigm keeps us "honest":
Any information that is not common knowledge
must be included in the message for it to be decipherable by the receiver;
there can be no hidden parameters.
This issue extends to inferring (and stating) real-valued parameters
to the "appropriate" level of precision.
The method is "safe":
If we use an inefficient code it can only make the hypothesis look less
attractive than otherwise.
There is a natural hypothesis test:
The null-theory corresponds to transmitting the data "as is".
(That does not necessarily mean in 8-bit ascii;
the language must be efficient.)
If a hypothesis cannot better the null-theory then it is not acceptable.
A more complex hypothesis fits the data better than a simpler model,
We see that MML encoding gives a trade-off between hypothesis complexity,
MsgLen(H), and the goodness of fit to the data, MsgLen(D|H).
The MML principle is one way to justify and realise Occam's razor.
When a model has one or more continuous, real-valued parameters
they must be stated to an "appropriate" level of precision.
The parameter must be stated in the explanation, and only a finite number
of bits can be used for the purpose, as part of MsgLen(H).
The stated value will often be close to the maximum-likelihood value
which minimises MsgLen(D|H).
If the -log likelihood, MsgLen(D|H), varies rapidly for small changes
in the parameter, the parameter should be stated to high precision.
If the -log likelihood varies only slowly with changes in the parameter,
the parameter should be stated to low precision.
The simplest case is the
or multinomial distribution where the data is a sequence of independent
values from such a distribution.
The hypothesis, H, is an estimate of the probabilities of the various
states (eg. the bias of a coin or a dice).
The estimate must be stated to an "appropriate" precision,
ie. in an appropriate number of bits.
or multi-state distributions,
eg., in strings, sequences and finite state automata.
Continuous: Including the
Normal probability distribution,
factor analysis, and
(circular) probability distribution.
HMMs (Hidden Markov Models),
(also [Bib] search for Dtree, or Dgraph,
or Megalithic stone circle).
clustering or mixture modelling, including
a short note on Snob.
J.D.Patrick, SNOB: A program for discriminating between classes,
[TR 91/151] with
Mixed Bayesian Networks ACSC2006
on structured models, BNs, continuous and discrete variables,
CaMML Hybrid Bayesian networks (2006)
on local structure.
- Causal modelling.
- Linear regression, curve and line fitting.
- Machine learning.
- Bioinformatics --
Molecular Biology applications,
eg. string or sequence analysis & alignment,
multiple alignment and evolutionary trees.