^CSE454^
[01]
>>
Information etc.Introduction
CSE454
2005
:
This document is online at
http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/
and contains hyperlinks to other resources

^CSE454^
<<
[02]
>>
InformationDefn: The information in learning of an event `A' of
probability P(A) is

^CSE454^
<<
[03]
>>
EntropyEntropy is the average information in a probability
distribution over a sample (data) space S. H =  SUM_{v in S} {P(v) log_{2} P(v)}

^CSE454^
<<
[04]
>>
e.g. Fair coin: P(head) = P(tail) = 1/2; e.g. Biased coin: P(head) = 3/4, P(tail) = 1/4; 
^CSE454^
<<
[05]
>>
Theorem H1if {p_{i}}_{i=1..N} and {q_{i}}_{i=1..N} are probability distributions then N SUM {  p_{i} log_{2}(q_{i}) } i=1is minimised when q_{i}=p_{i}. 
^CSE454^
<<
[06]
>>
Kullback Leibler distanceDefn: The KL distance (also relative entropy) of prob' dist'n {q_{i}} from {p_{i}} is N p_{i} SUM p_{i} log_{2}(  ) >= 0 i=1 q_{i}NB. >=0 by H1. Not necessarily symmetric. 
^CSE454^
<<
[07]
>>
CodingWe work with prefix codes: No code word is a prefix of another code word. Sender > Receiver + source > (01) > source symbols* encode decode symbols* Average codeword length, i.e. message length, is

^CSE454^
<<
[08]
>>

^CSE454^
<<
[09]
>>

^CSE454^
<<
[10]
>>
Minimum Message Length (MML)If we want to learn (infer) a hypothesis (theory  model) or parameter estimate(s), H, from data D
Note tradeoff: complexity of H v. complexity of DH, both measured in bits, prevents overfitting; this issue includes precision used to state continuousvalued parameters. 
^CSE454^
<<
[11]
>>
. . . MML Senderreceiver paradigm keeps us "honest" Sender > Receiver
No hidden information passed "under the table".
Safe: Cannot make a hypothesis look better than it really is. 
^CSE454^
<<
[12]
>>
EstimatorsAn estimator is a function from (a training set, i.e.) the sample (data) space, S, to the hypothesis (model, parameter) space H. e: S > He.g
different estimators may give different results
in different situations. 
^CSE454^
<<
[13]
>>
PredictionPrediction is different from inference. Inference (learning) is about finding an explanation (hypothesis, model, parameter est') for the data, prediction is about predicting future data and an explanation might or might not help. The best thing that can be done in prediction is to use an average over all hypotheses weighted by their posterior probabilities. © 2005 L. Allison, School of Computer Science and Software Engineering, 