|
Inductive Inference is detective work.
It is the business of trying to put order on a mass of data,
i.e. coming up with an explanation for it.
The data are often contradictory and contain measurement errors
and other uncertainties.
People do inference all of the time;
it is one of the hall-marks of intelligence.
Scientists do it when they form theories from experimental data.
The trouble is that an unlimited number of theories could explain
some given facts. Which is the best one?
William of Ockham (1285-1349)
is often credited with saying something like,
"if two theories explain the facts equally well then the simpler
theory is to be preferred".
This principle is known as Ockham's razor.
The razor is rather vague, but it can be made precise
and Minimum Message Length encoding (MML) does
just that:
|
P(H&D) = P(H).P(D|H) = P(D).P(H|D) |
- Bayes |
|
msgLen(E) = -log2(P(E)) |
- Shannon |
hence |
msgLen(H&D) = msgLen(H) + msgLen(D|H) |
|
Bayes' theorem
shows that the hypothesis, H, that best explains the data, D,
i.e. the hypothesis with the largest posterior probability, P(H|D),
is the one maximising P(H&D), and so
minimising msgLen(H)+msgLen(D|H),
hence the terminology `Minimum Message Length'
after Shannon's `mathematical theory of communication' (1948).
Note that prediction is a related but different problem to explanation
and requires correctly weighting the predictions of many hypotheses
to give optimal results.
Techniques for producing practical machine learning programs
that apply MML to many inference problems have been
developed in Computer Science at Monash University over 30+ years.
Some of these developments are described below.
|
Classification
|
|
class,
cluster,
generation-X,
group,
species,
type,
variation
|
Snob is a classification program.
It relies on MML to form the best set of classes (groups, clusters,
species, types, ...) to describe a given set of "things"
(individuals, items, measurements, observations, ...).
The first application of Snob was to a collection of measurements
taken from 217 seal skulls.
Snob found seven classes which correspond well to the male and female groups
of those species that were represented in reasonable numbers -
remember that Snob knows nothing about biology and was not told
the sex of the seals.
Since then, Snob has been applied to data from
earth sciences, medicine, molecular biology, psychology and many other areas.
- See:
- C. S. Wallace & D. M. Boulton.
An information measure for classification.
The Computer Journal, 11(2), pp.185-194, August 1968.
- C. S. Wallace & P. R. Freeman.
Estimation and inference by compact coding.
Jrnl. Royal Stat. Soc., 49(3), pp.240-265, 1987
NB. The problem addressed is also known as
unsupervised classification,
(numerical) taxonomy,
mixture modelling (in statistics), and
clustering (in machine learning).
Sequences and Time Series
change point,
looks familiar,
segment,
series,
sequence,
what's next,
0,1,1,0,1,1,?
|
Sequence or (time-) series data
may contain hidden structure (correlations).
The MML classification method above has been extended to allow
the `class' of a thing (item, measurement, observation, ...)
to be influenced by other things,
e.g. in a Markov model or a "repeats" model.
The new methods allow the true class structure to be recovered
on less data than before and indicates where and why class
membership (probably) changed.
(A thing can have many attributes;
an attribute can be drawn from normal (Gaussian),
Poisson, von-Mises, etc. distributions.)
- See:
T. Edgoose & L. Allison.
MML Markov classification of sequential data.
Stats. and Comp., 9(4), pp.269-278, 1999.
The question of
what is a good model for sequential data from an unknown source
is relevant to data compression, inductive inference, and prediction.
Lempel-Ziv based models (1976) asymptotically converge (in performance)
to a wide class of models and this gives them a claim to be a good model
to use on an unknown source; however, the convergence is slow
and can require a large amount of data.
The approximate-repeats model converges more rapidly
on DNA and other difficult data and can be used to find better
explanations of the data:
-
See:
L.Allison, T.Edgoose & T. I. Dix.
Compression of strings with approximate repeats.
Intell. Sys. in Mol. Biol, pp.8-16, 1998.
There are many measures for the similarity of two sequences:
longest common subsequence (LCS), edit-distance, time-warps, etc..
These measures generally assume that each sequence (series) is random,
except possibly for a bias in the distribution of observations,
i.e. each observation is assumed to be independent of the others.
It has been shown how to incorporate a wide class of "models"
of sequences into similarity measures:
- See:
- D. R. Powell, L. Allison, T. I. Dix, D. L. Dowe.
Alignment of low information sequences.
Aust. Comp. Sci. Theory Symp. '98, pp.215-230, Springer Verlag, 1998.
- L.Allison, D. Powell & T. I. Dix.
Compression and Approximate Matching.
Computer Jrnl. 42(1), pp.1-10, 1999.
- D. R. Powell, L. Allison & T. I. Dix.
Modelling alignment for non-random sequences.
17th ACS Australian Joint Conf. on AI (AI2004),
Springer-Verlag, LNCS/LNAI 3339, pp.203-214, 2004.
Decision Trees and Graphs
expert system,
noise,
rule extraction,
simple v. complex,
supervised learning,
uncertainty
|
A decision tree represents a certain type of decision-function,
based on tests on the attributes (properties, variables, ...)
of a thing (case, item, observation, ...).
The tree structure can be drawn and examined, and
it represents a kind of explanation of the data.
Decision trees are used in expert systems to "learn"
how a human expert does what she or he does well, given a training set
of examples.
There is a very large number of possible decision trees
for a given data set, and MML can be used to choose a best one,
balancing the size or complexity of the tree against
the accuracy of its predictions and thus avoiding the well-known
problem of over-fitting (i.e. fitting the noise in the data).
Wallace and Patrick showed how to improve on previous attempts
to judge this trade-off.
-
See:
C. S. Wallace & J. D. Patrick.
Coding decision trees.
Machine Learning, 11, pp.7-22, 1993.
Decision Graphs are generalisations of decision trees.
They model the same class of decision functions,
but can express disjunctive conditions ("or")
much more economically.
This often means that a better inference can be made given less data, and
that the resulting graph is easier to visualise than the corresponding tree.
-
See:
J. Oliver.
Decision graphs - an extension of decision trees.
4th Int. Conf on Artificial Intelligence, pp.343-359, 1993.
NB. A decision tree (graph) is more correctly known as a
classification tree (graph) and the problem addressed is then called
supervised classification.
Causal Models, Bayesian Networks
cause & effect,
correlation,
expert,
graph,
network,
why is it so?
|
Causal models (Bayesian networks)
are used by medical, biological and social scientists
to explain a large variety of phenomena,
e.g. does smoking cause lung cancer, or
does increased public spending on
education
lead to a long-term increase in national wealth?
Bayesian networks represent the probabilistic relationships between
variables (attributes) in graph-based models.
They can be used to represent any joint probability function.
They are particularly useful when the direct (causal)
relationships between variables are sparse,
because the update procedures for accomodating new observations are efficient.
Approximation inference techniques are available for
complex, non-sparse domains.
Dynamic Bayesian networks can be used to model relationships
that are dependent on previous states of the model.
Example applications include
diagnosis of telecommunication faults,
pipeline monitoring and control,
and medical diagnosis.
Bayesian networks are beginning to be widely used
in expert systems as an alternative to neural networks.
MML is used to cost the models and perform a trade-off
between model complexity and accuracy of fit to the observational data.
Both real-valued and discrete-valued models can be discovered
using one of the following search strategies:
greedy algorithms, genetic algorithms, stochastic sampling.
-
See:
C. S. Wallace & K. B. Korb.
Learning causal models by MML sampling.
In Causal Models and Intelligent Data Management,
A. Gammerman (ed), Springer Verlag, 1998.
- R. T. O'Donnell, L. Allison & K. B. Korb.
Learning Hybrid Bayesian Networks by MML.
Springer Verlag, LNCS Vol.4304, pp.192-203, 2006.
Megalithic Stone Circles
There are dozens of stone circles in the British Isles.
It turns out that most are not perfectly circular and elaborate theories
have been proposed to explain the deviations from circularity.
The question is, are these elaborate theories correct, or
are the "circles" simply inaccurate - made
by ancient peoples with rough measuring instruments on rough ground?
MML comes down firmly in favour of the latter - they are just rough circles.
- See:
-
J. D. Patrick.
An Information Measure Comparative Analysis of Megalithic Geometries.
PhD Thesis, Computer Science, Monash University, 1979.
J. D. Patrick & C. S. Wallace.
Stone circle geometries: an information theory approach.
in Archaeoastronomy in the Old World,
D.Heggie (ed), C.U.P., 1982.
Bayesian Poker Player
Poker is a good Inductive Inference problem.
You only have partial information
and your opponents are actively trying to deceive you.
Are they bluffing?
It is easy to calculate the probability that a good hand or a bad hand
was dealt to someone.
But your opponents know that you can do that.
You know that they know that you can do that, and ......
The Poker project is a test-bed for machine learning methods
such as Bayesian networks.
(Bayes published
`An Essay Towards Solving a Problem in the Doctrine of Chances'
in 1763.)
Molecular Biology
Molecular-Biology data contain experimental error and
random "noise" which make analysis difficult.
MML can deal with error and noise and, for example,
has been applied to classifying dihedral angles in proteins
to throw light on typical protein structures.
Understanding protein structure is important in medicine and drug design.
![[phi v. psi graph]](phipsi.gif)
Phi and psi are the two free dihedral-angles per amino-acid
on a protein backbone.
The peaks represent the centres of classes found by the program.
- See:
T. Edgoose, L. Allison & D. L. Dowe.
An MML classification of protein structure that knows about angles and sequences.
Pacific Symposium on Biocomputing '98, pp.585-596, January 1998.
MML has been used in other computer programs for
Molecular Biology.
Computer Science
The MML research group
carries out research into Inductive Inference.
Computer Science units in
Computer Programming,
Algorithms and Data Structures,
Foundations of C.S. and
Artificial Intelligence
give a good background for this work.
Also see
C.S.Wallace's publications.
-- LA 1995--2005
|
|