
Inductive Inference and Machine Learning by Minimum Message Length (MML) encoding.
Also see
models & parameters,
William of Ockham,
Thomas Bayes,
R. A. Fisher, and
C. S. Wallace.
There is a nonTechnical Introduction
[here].
See also:
 C. S. Wallace and M. P. Georgeff.
A General Objective for Inductive Inference.
TR 32,
Dept. Computer Science,
Monash University,
March 1983.
 J. J. Oliver and D. J. Hand,
Introduction to Minimum Encoding Inference,
TR 494,
Dept. Stats. Open Univ. and also
TR 94/205 Dept. Comp. Sci. Monash Univ.
 J. J. Oliver and R. A. Baxter,
MML and Bayesianism: Similarities and Differences,
TR 94/206.
 R. A. Baxter and J. J. Oliver,
MDL and MML: Similarities and Differences,
TR 94/207
 The Computer Journal, special issue 42(4), 1999, includes:
 C. S. Wallace & D. L. Dowe,
Minimum Message Length and Kolmogorov Complexity,
pp.270283,
also ...
 Refinements of MDL and MML Coding, pp.330337
 and discussion on MML and MDL by various authors.
 G. E. Farr & C. S. Wallace,
The complexity of strict minimum message length inference,
Computer Journal 45(3) pp.285292 2002
[link].
 C. S. Wallace,
A Brief History of MML, 20/11/2003.
 L. Allison,
Models for machine learning and data mining in functional programming,
J. Functional Programming (JFP), 15(1), pp.1532,
doi:10.1017/S0956796804005301,
Jan. 2005.
 C. S. Wallace,
Statistical & Inductive Inference by MML,
Springer, Information Science and Statistics,
isbn:038723795X, 2005.
 L. Allison,
Coding Ockham's Razor,
Springer, isbn13:9783319764320, 2018.
For a hypothesis H and data D we have from Bayes:
P(H&D) = P(H).P(DH) = P(D).P(HD)
 P(H), prior probability of hypothesis H
 P(HD), posterior probability of hypothesis H
 P(DH), likelihood of the hypothesis,
actually a function of the data given H.
From Shannon's Mathematical Theory of Communication (1949) we know
that in an optimal code, the message length of an event E, MsgLen(E),
where E has probability P(E), is given by
MsgLen(E) = log_{2}(P(E)):
MsgLen(H&D)
= MsgLen(H)+MsgLen(DH)
= MsgLen(D)+MsgLen(HD)
Now in inductive inference one often wants the hypothesis H
with the largest posterior probability.
MsgLen(H) can usually be estimated well,
for some reasonable prior on hypotheses.
MsgLen(DH) can also usually be calculated.
Unfortunately it is often impractical to estimate P(D)
which is a pity because it would yield P(HD).
However, for two rival hypotheses, H and H'
MsgLen(HD)MsgLen(H'D)
= MsgLen(H)+MsgLen(DH)
 MsgLen(H')MsgLen(DH')
= posterior log odds ratio
Consider
a transmitter T and a receiver R
connected by one of Shannon's communication channels.
T must transmit some data D to R.
T and R may have previously agreed on a code book for hypotheses,
using common knowledge and prior expectations.
If T can find a good hypothesis, H, (theory, structure, pattern, ...) to fit
the data then she may be able to transmit the data economically.
An explanation is a two part message:
(i) transmit H taking MsgLen(H) bits, and
(ii) transmit D given H taking MsgLen(DH) bits.
The message paradigm keeps us "honest":
Any information that is not common knowledge
must be included in the message for it to be decipherable by the receiver;
there can be no hidden parameters.
This issue extends to inferring (and stating) realvalued parameters
to the "appropriate" level of precision.
The method is "safe":
If we use an inefficient code it can only make the hypothesis look less
attractive than otherwise.
There is a natural hypothesis test:
The nulltheory corresponds to transmitting the data "as is".
(That does not necessarily mean in 8bit ascii;
the language must be efficient.)
If a hypothesis cannot better the nulltheory then it is not acceptable.
A more complex hypothesis fits the data better than a simpler model,
in general.
We see that MML encoding gives a tradeoff between hypothesis complexity,
MsgLen(H), and the goodness of fit to the data, MsgLen(DH).
The MML principle is one way to justify and realise Occam's razor.
Continuous RealValued
Parameters
When a model has one or more continuous, realvalued parameters
they must be stated to an "appropriate" level of precision.
The parameter must be stated in the explanation, and only a finite number
of bits can be used for the purpose, as part of MsgLen(H).
The stated value will often be close to the maximumlikelihood value
which minimises MsgLen(DH).
If the log likelihood, MsgLen(DH), varies rapidly for small changes
in the parameter, the parameter should be stated to high precision.
If the log likelihood varies only slowly with changes in the parameter,
the parameter should be stated to low precision.
The simplest case is the
multistate
or multinomial distribution where the data is a sequence of independent
values from such a distribution.
The hypothesis, H, is an estimate of the probabilities of the various
states (eg. the bias of a coin or a dice).
The estimate must be stated to an "appropriate" precision,
ie. in an appropriate number of bits.
Basics:
Probability, and
Information, and
Coding.
Bayesian inference.
Discrete: Including
Multinomial and
binomial,
or multistate distributions,
eg., in strings, sequences and finite state automata.
Integers.
Continuous: Including the
Normal probability distribution,
factor analysis, and
von Mises
(circular) probability distribution.
Structured: Including
HMMs (Hidden Markov Models),
DecisionTrees and
Graphs
(also [Bib] search for Dtree, or Dgraph,
or Megalithic stone circle).
Classification,
clustering or mixture modelling, including
a short note on Snob.
(Also
J.D.Patrick, SNOB: A program for discriminating between classes,
[TR 91/151] with
[abstract].)
Bayesian Networks:
Bayesian networks,
Mixed Bayesian Networks ACSC2006
on structured models, BNs, continuous and discrete variables,
CaMML Hybrid Bayesian networks (2006)
on local structure.
 Causal modelling.
 Linear regression, curve and line fitting.
 Machine learning.
 Bioinformatics 
Molecular Biology applications,
eg. string or sequence analysis & alignment,
multiple alignment and evolutionary trees.

