|
Inductive Inference and Machine Learning by Minimum Message Length (MML) encoding.
Also see
models & parameters,
William of Ockham,
Thomas Bayes,
R. A. Fisher, and
C. S. Wallace.
There is a non-Technical Introduction
[here].
See also:
- C. S. Wallace and M. P. Georgeff.
A General Objective for Inductive Inference.
TR 32,
Dept. Computer Science,
Monash University,
March 1983.
- J. J. Oliver and D. J. Hand,
Introduction to Minimum Encoding Inference,
TR 4-94,
Dept. Stats. Open Univ. and also
TR 94/205 Dept. Comp. Sci. Monash Univ.
- J. J. Oliver and R. A. Baxter,
MML and Bayesianism: Similarities and Differences,
TR 94/206.
- R. A. Baxter and J. J. Oliver,
MDL and MML: Similarities and Differences,
TR 94/207
- The Computer Journal, special issue 42(4), 1999, includes:
- C. S. Wallace & D. L. Dowe,
Minimum Message Length and Kolmogorov Complexity,
pp.270-283,
also ...
- Refinements of MDL and MML Coding, pp.330-337
- and discussion on MML and MDL by various authors.
- G. E. Farr & C. S. Wallace,
The complexity of strict minimum message length inference,
Computer Journal 45(3) pp.285-292 2002
[link].
- C. S. Wallace,
A Brief History of MML, 20/11/2003.
- L. Allison,
Models for machine learning and data mining in functional programming,
J. Functional Programming (JFP), 15(1), pp.15-32,
doi:10.1017/S0956796804005301,
Jan. 2005.
- C. S. Wallace,
Statistical & Inductive Inference by MML,
Springer, Information Science and Statistics,
isbn:038723795X, 2005.
- L. Allison,
Coding Ockham's Razor,
Springer,
isbn13:978-3319764320,
doi:10.1007/978-3-319-76433-7,
2018.
For a hypothesis H and data D we have from Bayes:
P(H&D) = P(H).P(D|H) = P(D).P(H|D)
- P(H), prior probability of hypothesis H
- P(H|D), posterior probability of hypothesis H
- P(D|H), likelihood of the hypothesis,
actually a function of the data given H.
From Shannon's Mathematical Theory of Communication (1949) we know
that in an optimal code, the message length of an event E, MsgLen(E),
where E has probability P(E), is given by
MsgLen(E) = -log2(P(E)):
MsgLen(H&D)
= MsgLen(H)+MsgLen(D|H)
= MsgLen(D)+MsgLen(H|D)
Now in inductive inference one often wants the hypothesis H
with the largest posterior probability.
MsgLen(H) can usually be estimated well,
for some reasonable prior on hypotheses.
MsgLen(D|H) can also usually be calculated.
Unfortunately it is often impractical to estimate P(D)
which is a pity because it would yield P(H|D).
However, for two rival hypotheses, H and H'
MsgLen(H|D)-MsgLen(H'|D)
= MsgLen(H)+MsgLen(D|H)
- MsgLen(H')-MsgLen(D|H')
= posterior -log odds ratio
Consider
a transmitter T and a receiver R
connected by one of Shannon's communication channels.
T must transmit some data D to R.
T and R may have previously agreed on a code book for hypotheses,
using common knowledge and prior expectations.
If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit
the data then she may be able to transmit the data economically.
An explanation is a two part message:
(i) transmit H taking MsgLen(H) bits, and
(ii) transmit D given H taking MsgLen(D|H) bits.
The message paradigm keeps us "honest":
Any information that is not common knowledge
must be included in the message for it to be decipherable by the receiver;
there can be no hidden parameters.
This issue extends to inferring (and stating) real-valued parameters
to the "appropriate" level of precision.
The method is "safe":
If we use an inefficient code it can only make the hypothesis look less
attractive than otherwise.
There is a natural hypothesis test:
The null-theory corresponds to transmitting the data "as is".
(That does not necessarily mean in 8-bit ascii;
the language must be efficient.)
If a hypothesis cannot better the null-theory then it is not acceptable.
A more complex hypothesis fits the data better than a simpler model,
in general.
We see that MML encoding gives a trade-off between hypothesis complexity,
MsgLen(H), and the goodness of fit to the data, MsgLen(D|H).
The MML principle is one way to justify and realise Occam's razor.
Continuous Real-Valued
Parameters
When a model has one or more continuous, real-valued parameters
they must be stated to an "appropriate" level of precision.
The parameter must be stated in the explanation, and only a finite number
of bits can be used for the purpose, as part of MsgLen(H).
The stated value will often be close to the maximum-likelihood value
which minimises MsgLen(D|H).
If the -log likelihood, MsgLen(D|H), varies rapidly for small changes
in the parameter, the parameter should be stated to high precision.
If the -log likelihood varies only slowly with changes in the parameter,
the parameter should be stated to low precision.
The simplest case is the
multi-state
or multinomial distribution where the data is a sequence of independent
values from such a distribution.
The hypothesis, H, is an estimate of the probabilities of the various
states (eg. the bias of a coin or a dice).
The estimate must be stated to an "appropriate" precision,
ie. in an appropriate number of bits.
Basics:
Probability, and
Information, and
Coding.
Bayesian inference.
Discrete: Including
Multinomial and
binomial,
or multi-state distributions,
eg., in strings, sequences and finite state automata.
Integers.
Continuous: Including the
Normal probability distribution,
factor analysis, and
von Mises
(circular) probability distribution.
Structured: Including
HMMs (Hidden Markov Models),
Decision-Trees and
-Graphs
(also [Bib] search for Dtree, or Dgraph,
or Megalithic stone circle).
Classification,
clustering or mixture modelling, including
a short note on Snob.
(Also
J.D.Patrick, SNOB: A program for discriminating between classes,
[TR 91/151] with
[abstract].)
Bayesian Networks:
Bayesian networks,
Mixed Bayesian Networks ACSC2006
on structured models, BNs, continuous and discrete variables,
CaMML Hybrid Bayesian networks (2006)
on local structure.
- Causal modelling.
- Linear regression, curve and line fitting.
- Machine learning.
- Bioinformatics --
Molecular Biology applications,
eg. string or sequence analysis & alignment,
multiple alignment and evolutionary trees.
|
|