^CSE454^ [01] >>

Information etc.


  • Information
  • Entropy
  • Kullback Leibler distance
  • Coding
  • Minimum Message Length (MML) inference
  • Estimators
  • Inference v. prediction

CSE454 2005 : This document is online at   http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/   and contains hyper-links to other resources - Lloyd Allison ©.

^CSE454^ << [02] >>


Defn: The information in learning of an event `A' of probability P(A) is I(A) = -log2P(A) bits.

  • P(A) = 1 => I(A) = 0
  • P(B) = 0 => I(B) = ∞ bits(!)
  • P(C) = 1/2 => I(C) = 1 bit
  • also measured in nits (aka nats)   -logeP(A)     NB. Maths easier
    (Alan Turing used base10, the `ban' 3+ bits, and the deciban -- Hodges, Alan T' the Enigma, p197.)
  • Learning: Zero bits of information are learned if event is known already
^CSE454^ << [03] >>


Entropy is the average information in a probability distribution over a sample (data) space S.

   H = - SUMv in S {P(v) log2 P(v)}
  • -log2 P(v)} is the information in v
  • H is an average (weighted by P(v))
^CSE454^ << [04] >>

e.g. Fair coin: P(head) = P(tail) = 1/2;
H = -1/2 log2(1/2) - 1/2 log2(1/2)
= 1/2 + 1/2 = 1 bit

e.g. Biased coin: P(head) = 3/4, P(tail) = 1/4;
H = -3/4 log2(3/4) - 1/4 log2(1/4)
= 3/4{2 - log23} + 2/4
= 2 - 3/4 log23 < 1 bit

^CSE454^ << [05] >>

Theorem H1

if {pi}i=1..N and {qi}i=1..N are probability distributions then

  SUM { - pi log2(qi) }
is minimised when qi=pi.

^CSE454^ << [06] >>

Kullback Leibler distance

Defn: The K-L distance (also relative entropy) of prob' dist'n {qi} from {pi} is

     N         pi
  SUM pi log2( -- )  >= 0
   i=1         qi
NB. >=0 by H1. Not necessarily symmetric. (Use an integral for continuous prob' dist'ns.)
^CSE454^ << [07] >>


We work with prefix codes: No code word is a prefix of another code word.

      Sender     ----------->   Receiver

 source     ----->  (0|1)   ----->   source
 symbols*   encode          decode   symbols*

Average code-word length, i.e. message length, is SUMv in S{P(v) |code(v)|}.
Cannot be less than entropy H. This lower bound is achieved if |code(v)| = -log2P(v).

^CSE454^ << [08] >>
Shannon (1948) Mathematical Theory of Communication
1. Any uniquely decodable code for S has avg length >= H
2. There exists a prefix code for S with avg length <=H+1.

Huffman (1952)
algorithm to construct code

Arithmetic coding
gets arbitrarily close to coding lower bound

compression = modelling + coding,   and coding is "solved"
^CSE454^ << [09] >>
probability, message length and coding
^CSE454^ << [10] >>

Minimum Message Length (MML)

If we want to learn (infer) a hypothesis (theory | model) or parameter estimate(s), H, from data D

devise code for shortest expected two-part message
(i) H and then (ii) D|H

Note trade-off: complexity of H   v. complexity of D|H, both measured in bits, prevents over-fitting;

this issue includes precision used to state continuous-valued parameters.

^CSE454^ << [11] >> . . . MML

Sender-receiver paradigm keeps us "honest"

      Sender     ----------->   Receiver
  1. Sender and receiver get together and agree on priors, code-book, algorithms etc. Can discuss expected data, but have no "real" data.
  2. Sender and receiver are separated.
  3. Sender gets real data.
  4. Sender uses code-book, algorithms etc. to encode data.
  5. Receiver gets encoded data.
  6. Receiver uses code-book, algorithms etc. to decode data.

No hidden information passed "under the table". Safe: Cannot make a hypothesis look better than it really is.
Data-compression techniques can be used to design a "good" code if an optimal code is not known.

^CSE454^ << [12] >>


An estimator is a function from (a training set, i.e.) the sample (data) space, S, to the hypothesis (model, parameter) space H.

  e: S ---> H
  • maximum likelihood estimator
  • minimum message length (MML) estimator
  • minimum expected Kullback-Leibler distance (min EKL) estimator
  • maximum posterior estimator
  • etc. ...

different estimators may give different results in different situations.

^CSE454^ << [13] >>


Prediction is different from inference.

Inference (learning) is about finding an explanation (hypothesis, model, parameter est') for the data, prediction is about predicting future data and an explanation might or might not help.

The best thing that can be done in prediction is to use an average over all hypotheses weighted by their posterior probabilities.

© 2005 L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.
Created with "vi (IRIX)",   charset=iso-8859-1