CSE454 2002 : This document is online at http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/ and contains hyper-links to other resources - Lloyd Allison ©.
Defn: The information in learning of an event `A' of
probability P(A) is
Entropy is the average information in a probability
distribution over a sample (data) space S.
Defn:
H = - SUMv in S {P(v) log2 P(v)}
e.g. Fair coin: P(head) = P(tail) = 1/2;
H = -1/2 log2(1/2) - 1/2 log2(1/2)
= 1/2 + 1/2
= 1 bit
e.g. Biased coin: P(head) = 3/4, P(tail) = 1/4;
H = -3/4 log2(3/4) - 1/4 log2(1/4)
= 3/4{2 - log23} + 2/4
= 2 - 3/4 log23
< 1 bit
if {pi}i=1..N and {qi}i=1..N are probability distributions then
N SUM { - pi log2(qi) } i=1is minimised when qi=pi.
Defn: The K-L distance (also relative entropy) of prob' dist'n {qi} from {pi} is
N pi SUM pi log2( -- ) >= 0 i=1 qiNB. >=0 by H1. Not necessarily symmetric.
We work with prefix codes: No code word is a prefix of another code word.
Sender -----------> Receiver * source -----> (0|1) -----> source symbols* encode decode symbols*
Average code-word length, i.e. message length, is
Cannot be less than entropy H.
This lower bound is achieved if
If we want to learn (infer) a hypothesis (theory | model) or parameter estimate(s), H, from data D
Note trade-off: complexity of H v. complexity of D|H, both measured in bits, prevents over-fitting;
this issue includes precision used to state continuous-valued parameters.
Sender-receiver paradigm keeps us "honest"
Sender -----------> Receiver
No hidden information passed "under the table".
Safe: Cannot make a hypothesis look better than it really is.
Data-compression techniques can be used to design
a "good" code if an optimal code is not known.
An estimator is a function from the sample (data) space, S, to the hypothesis (model, parameter) space H.
e: S ---> He.g
different estimators may give different results
in different situations.
Prediction is different from inference.
Inference (learning) is about finding an explanation (hypothesis, model, parameter est') for the data, prediction is about predicting future data and an explanation might or might not help.
The best that can be done in prediction is to use an average over all hypotheses weighted by their posterior probabilities.