^CSE454^ [01] >>

# 1: Introduction

• Learning: Given data, D, learn (or fit, or estimate) a hypothesis, H (or (parameters of) a distribution, model, or theory), for D.

• Prediction: Given input attribute(s) (or variable(s)), possibly trivial, predict output attribute(s).

• Data: Observations (values) drawn from some (sample or data) space.

• Typically
• a statistical model is a formal mathematical model,
• a machine learning method may be based on a complex model & use approximations,
• data mining emphasises (very) large data sets, efficient & maybe ad hoc methods.
Much overlap, and different terminology!
Notation and terminology . . .

CSE454 2005 : This document is online at   http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/   and contains hyper-links to other resources - Lloyd Allison ©.

 ^CSE454^ << [02] >> Sample space is set of possible outcomes of some experiment e.g. examine 1st element of a gene; sample space = Base = {A, C, G, T} But e.g. examine DNA sequence of HUMHBB, sample space = Base* An event is a subset, possibly singular, of the sample space e.g. purine = {A, G} NB. The term data space is often used, in machine learning.
 ^CSE454^ << [03] >> A random variable, X, takes values, with probabilities, from the sample space Write P(X=A) or just P(A) etc. e.g. P(X=A) = 0.4±   for plasmodium falciparum [*]
^CSE454^ << [04] >>

# Inference

People often distinguish between

• selecting a model class,
• selecting a model from a class,
• estimating the parameters of a model.
e.g.
• model class = polynomials
• fully-parameterized model = 3 x2 - 4.5 x + 7.2
^CSE454^ << [05] >>

# Bayes

If B1, B2, ..., Bk is a partition of a set B (of causes) then

```
P(A|Bi).P(Bi)
P(Bi|A) = ------------------------------    i=1, 2, ..., k
P(A|B1).P(B1)+...+P(A|Bk).P(Bk)
```
 ^CSE454^ << [06] >> . . . applied to data D and hypotheses Hi: ```P(D|H1).P(H1)+...+P(D|Hk).P(Hk) = P(D) P(Hi|D) = P(D|Hi).P(Hi) / P(D) posterior P(Hi|D) P(D|Hi).P(Hi) ------- = -------------- posterior odds-ratio P(Hj|D) P(D|Hj).P(Hj) ```
 ^CSE454^ << [07] >> P(Hi)     prior probability of Hi P(Hi|D)     posterior probability of Hi P(D|Hi)     likelihood NB. Can ignore P(Hi) in posterior odds-ratio if, and only if, P(Hi)=P(Hj). Maximum likelihood may can cause problems when we have inequality.
^CSE454^ << [08] >>

# Example

C1, a fair coin, P(H) = P(T) = 0.5.

C2, a biased coin, P(H) = 2/3, P(T) = 1/3.

One of the coins is thrown 4 times, giving H, T, T, H.

Which coin was thrown? H1 : was C1.   H2 : was C2.

 ^CSE454^ << [09] >> Prior, P(C1) = P(C2) = 0.5. Likelihood, P(HTTH | C1) = 1/16 and   P(HTTH | C2) = 4/9 . 1/9 = 4/81. Posterior odds-ratio, P(C1|HTTH)/P(C2|HTTH) = (1/16 . 1/2) / (4/81 . 1/2) = 81/64.
 ^CSE454^ << [10] >> Now, P(C1|HTTH) + P(C2|HTTH) = 1 and if x/(1-x) = 81/64, then 64.x = 81 - 81.x, x = 81/145 P(C1|HTTH) = 81/145. This case is simple because the model space is discrete, in fact finite (2).
^CSE454^ << [11] >>

# e.g. prediction

Know P(C1) = 81/145,   P(C2) = 64/145.

The more likely coin is C1.

If we assumed the coin really was C1, would predict P(H) = 0.5 in future.

But the coin might be C2.

Should predict   P(H) = 81/145 . 1/2 + 64/145 . 2/3 = (243 + 256) / (145 . 6) = 499 / 870 = 0.57

i.e. use a weighted average of the hypotheses.

^CSE454^ << [12] >>

# Conclusion

We have looked at
• data
• models, parameters
• priors, likelihood, posterior
• inference
• prediction
simple examples!

© 2005 L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.
Created with "vi (IRIX)",   charset=iso-8859-1