The probability density function of
a normal distribution (Gaussian distribution),
N(μ, σ),
with mean μ and
standard deviationσ > 0,
for -∞ < x < ∞, is given below:
Above: N(0,1)
Probability density function:
f(x) =
1
- (x-μ)2 / 2.σ2
e
√(2.π) σ
and of course
-∞∫+∞ f(x) dx = 1
MML, Coding
This section takes extracts from, and paraphrases,
s4.2 in Wallace and Boulton (1968)
removing it from the classification context.
It does this to follow the historical development of MML
and because that particular paper is less widely known than it should be.
(Later, it was seen that the derivations
were a special case of a more general form using the
[Fisher]
information, and that the results could be obtained by
[using this general form].)
Data
Consider a sample
x1, x2, ..., xn
drawn from a normal distribution.
Each data measurement, xi, is given to an accuracy of
+/-epsilon/2.
Assuming that epsilon is small,
the probability density function of N(μ,σ)
can be approximated as a constant over
[xi-epsilon/2, xi+epsilon/2].
The message length for the sample is therefore:
where μ and σ are inferred values.
The message length is in nits, i.e. natural bits.
The crucial question is how accurately should μ and σ
be stated in the hypothesis, i.e. in the header before coding the sample data?
μ
The message length for the sample data (above) is minimised by setting
μ equal to the sample mean, m, but it requires an infinite number of
bits (or nits) to state this value in general.
μ can only be stated to a finite accuracy, +/-b/2, say.
Leaving σ fixed, for now,
the question now is, what is an optimum value for b?
We would like to be able to state μ as having the value m,
but in practice we can only state it within
[m-b/2 ... m+b/2].
It is also assumed that μ as known a priori (see below) to lie
in a range of size `a'.
It therefore takes -ln(b/a) nits to state μ to accuracy +/-b/2:
- ln(b/a)
σ
Similarly to μ, σ is supposed known a priori
to lie in [0 ... p].
and is stated to an accuracy of +/-c/2.
It takes -ln(c/p) nits to state σ to within +/-c/2:
- ln(c/p)
Note that the sample standard deviation, z, is given by
n 2
z = sqrt{ (SIGMA (xi - m) )/n }
i=1
This is a biased estimator for σ,
e.g. consider n equal to one or to two,
although there is little bias for large values of n.
Replacing the divisor, n, by (n-1) gives a less biased estimator.
Optimum Accuracy
The accuracy with which μ (+/-b/2) and σ (+/-c/2) should be stated
is found by adding together the message length contributions
from the data, μ & σ and choosing b and c to minimise the
expected total,
i.e. differentiate w.r.t. b (and c) and set the result to zero.
The total is:
b c n epsilon 1 |xi - mu|2
-ln{-} - ln{-} + SIGMA{-ln(----------------) + - |-------| }
a p i=1 sqrt(2 pi) sigma 2 | sigma |
bc epsilon n 1 |xi-m-(mu-m)|2
= -ln{--} - n ln(----------------) + SIGMA{ - |-----------| }
ap sqrt(2 pi) sigma i=1 2 | sigma |
and noting that
(xi-m-(μ-m))2
= (xi-m)2-2(xi-m)(μ-m)+(μ-m)2 and that
SIGMA1..n(xi-m) = 0
bc epsilon {z2 + (mu-m)2 }
= -ln{--} - n ln(----------------) + n -------------
ap sqrt(2 pi) sigma 2 sigma2
The expectation of
(μ-m)2 over [m-b/2, m+b/2],
and assuming a uniform distribution in this range,
is (1/b)[b3/3]-b/2..b/2
= (2 b3/24)/b = b2/12.
Drop this into the total, differentiate with respect to b and ...
1 b
- - + n --------- = 0
b 12 sigma2
b = sigma sqrt(12/n)
Substituting this value for b, the total message length is now:
c sigma 12 epsilon (z2 + sigma2/n)
-ln{-} - ln{-----sqrt(--)} - n ln{---------------} + n--------------
p a n sqrt(2 pi)sigma 2 sigma2
c sigma 12 epsilon n z2
= -ln{-} - ln{-----sqrt(--)} - n ln{---------------} + -------- + 1/2
p a n sqrt(2 pi)sigma 2 sigma2
To find the optimal value of σ, which interestingly turns out to differ from z,
differentiate with respect to σ and set to zero ...
n-1 n z2
----- = -------
sigma' sigma'3
sigma' = z sqrt(n/(n-1))
Now σ can only be stated to finite accuracy, i.e. as being in the range
[σ' - c/2 ... σ' + c/2].
Similarly to μ-m,
the expectation of (σ-σ')2 is c2/12.
Let σ = σ'(1+h) in
c sigma 12 epsilon n z2
-ln{-} - ln{-----sqrt(--)} - n ln{---------------} + -------- + 1/2
p a n sqrt(2 pi)sigma 2 sigma2
n z2 {terms }
= -ln(c) + (n-1) ln(1+h) + ---------------- + {independent}
2 sigma'2 (1+h)2 {of h }
(1-2h+3h2)
= -ln(c) + (n-1)(h-h2/2) + nz2---------- + O(h3) + {terms no h}
2sigma'2
assuming that the fractional error, h, in stating the standard deviation is small,
and always remembering that something else should be done if this is not the case.
Now, h = (σ - σ')/σ'
and the expectation,
E(h2) = c2/(12 σ'2)
and putting the latter into the message length gives an expected message length:
= -ln(c) + c2(n-1)/(12 sigma'2) + ...
Differentiate with respect to c and set to zero:
1 (n-1)
- = 2c ---------
c 12 sigma'2
c = sigma' sqrt(6/(n-1))
Demonstration
The HTML FORM below can be used to
generate a sample of N data items
at random from a normal distribution, N(μ, σ),
having a specified mean, μ, and standard deviation, σ.
The FORM also performs various calculations:
Maximum likelihood estimates of μ and σ.
Integration of N(μ, σ) over a limited range,
numerically and hence approximately.
Message lengths under fixed codes
where a given mean and standard deviation are
"common knowledge";
you can nominate estimates, μ' & σ',
for use in such a fixed code.
Two-part message length calculations from Wallace and Boulton (1968)
where inferred values of μ and σ
are stated to an optimum level of accuracy
and the data is then encoded using those stated values.
Warning: do not make N too large .... JavaScript
is interpreted after all.
L . A
NB. If μ' and σ' are set so that (some of) the values in the sample
are more than "a few" standard deviations away from μ',
then an infinite message length may be calculated
for fixed(μ', σ'),
given the limited accuracy of computer arithmetic.
On the other hand, it is relatively "safe"
to set σ' to be large compared to the sample S.D..
The form above forces the prior on μ
to be a region of at least a "few" standard
deviations around the mean.
Priors
It is necessary to have prior probability distributions
of some kind for σ and for μ in order to be able
to transmit them in the first part of the message.
In the HTML form above, it is initially assumed that μ is
in a range of 1000 and σ in the range [0..100];
these values can be changed in the form.
Uniform priors are used within the ranges,
although other distributions can be used in principle.
(The ranges for a class-attribute
in the original application to clustering by Wallace and Boulton (1968)
are functions of the statistics of the entire population.)
The use of priors is disliked by some,
so the following points should be made:
The computations are remarkably insensitive to the choice of prior,
although the latter can make some difference
particularly on small data-sets.
One can argue that if (expert) prior knowledge is available
then it should be used!
It is impossible to transmit μ and σ without some
kind of prior knowledge.
So-called `universal' codes imply priors nevertheless,
although sometimes the priors are implicit and hard to untangle.
Notes
All introductory books on probability and statistics will cover
the basic properties of the
normal distribution, e.g.:
P. L. Meyer.
Introductory Probability and Statistical Application.
Addison Wesley, 1970.
The study of the normal distribution (and the multi-state distribution)
in the context of (unsupervised) classification
(also known as clustering, numerical taxonomy, and mixture modelling)
by Wallace and Boulton is one of the first applications
of minimum message length (MML) encoding
to a practical machine-learning problem yielding a useful computer program:
See also the
Special Issue on Clustering and Classification,
the Computer Journal,
F. Murtagh (ed),41(8), 1998.
Contains several papers on classification,
marks the 30th anniversary of the Wallace & Boulton (1968)
paper, and also includes a new paper by Wallace on modelling
spatially correlated data.