1. (a) Briefly,
define the following terms as they apply in
`Learning and Prediction I':
 Joint probability.
 Independent (as used with probability).
 Bayes's theorem.
 Information.
[10 marks]
(b) Four tetrahedral dice are placed in a cloth bag.
The dice are identical in appearance and in weight.
Three of the dice are unbiased,
Pr(1) = Pr(2) = Pr(3)
= Pr(4) = 1/4 = 0.25.
One of the dice is biased,
Pr(1) = 1/2 = 0.5,
Pr(2) = 1/4 = 0.25,
Pr(3) = Pr(4) = 1/8 = 0.125.
The bag is shaken and one dice is drawn out.
The selected dice is rolled four times with these results:
1, 4, 1, 2.
 What is the prior probability of a fair dice versus a biased dice?
 What is the likelihood of 1,4,1,2 for a fair dice, and
for a biased dice?
 What is the posterior probability of a fair dice versus a biased dice?
[10 marks]

>notes> 

2. (a) For a multistate probability distribution, P(.),
define how to calculate the `Entropy', and
give an explanation of what the quantity represents.
[5 marks]
(b) What is the entropy of the following distribution for a 6sided dice?
[4 marks]
1  2  3  4  5  6 
1/2  1/4 
1/16  1/16  1/16  1/16 
(c) For multistate probability distributions,
define the `Kullback Leibler' (KL) distance
from the "true" probability distribution P(.)
to another probability distribution Q(.).
[6 marks]
(d) Below are two probability distributions, P and Q, for dice.
Calculate the KL distance from P to Q.
[5 marks]
 1  2  3  4  5  6 
P:  1/6  1/6  1/6  1/6  1/6  1/6 
Q:  1/2  1/4  1/16  1/16  1/16  1/16 


3.
You are observing a multiplayer game that is played with
a silver coin, `S', and a gold coin, `G'.
Each turn involves throwing the two coins.
You notice that <H,H> and <T,T> seem to occur rather often
and <H,T> and <T,H> rather infrequently.
You watch the game, collect some data on the coins (right) and
decide to use MML to analyse it
(of course a real data set would be much larger).
(a) You model the data as a mixture of one or more classes.
Each class is defined by a bivariate model.
Each bivariate model consists of two 2state distributions, i.e.
one 2state distribution for S and another for G.
Carefully list everything that must be specified in a
message (i) for a oneclass "mixture" and the data,
and (ii) for a twoclass mixture and the data.
Estimate the message lengths (the parts and totals)
for the best oneclass and the best 2class mixtures. Show working.
(You may use "total assignment" of observations to classes
in order to simplify calculations.)
[15 marks]
(b) One could convert the data as follows:
<H,H> >0,
<H,T> >1,
<T,H> >2,
<T,T> >3
and could model the transformed
univariate data by a 4state distribution.
In this case, does it now make sense to talk about
a oneclass v. a twoclass model, or not?
If so, why and what are the consequences, otherwise why not?
[5 marks]
 
trial  S  G 
1  H  H 
2  T  T 
3  T  H 
4  H  H 
5  T  T 
6  H  H 
7  H  T 
8  T  T 


(a) Data nicely arranged for Snob:
`trial' is the `thing' number,
`S' is attribute 1, and
`G' is attribute 2.
You could try the data through Snob
(replicate the data a few thousand(?) times
to a realistic size).
The 1class model will have
<Pr(S=H)=0.5, Pr(G=H)=0.5>.
(The factorial method lets you estimate the cost of model and data.)
The 2class model can bias class1, say, towards <H,H> and
class2 towards <T,T>.
Put in some plausible numbers and . . .
(b) No: Note that a mixture of two or more
4state distributions is a 4state distribution.
They are indistinguishable.
(The same is not true of mixtures of Normal distributions, for example.)

4.
(a) Consider classification (decision) trees over multivariate data
with discrete and/or continuous attributes to be tested
and a discrete (`class') attribute to be predicted.
Describe an efficient scheme to encode
(i) a tree, and
(ii) the data given the tree.
[10 marks]
 A multivariate data set has the following attributes:
 att 1: M, F  gender.
 att 2: continuous  age measured to one decimal(!) place, e.g. 42.3.
 att 3: g, v, w, j  4 possible values.
 att 4: boolean
 att 5: boolean  the `class',
i.e. the attribute to be predicted.
 There are 661 data items.
(b) Describe the encoding of the following tree (without data)
in reasonable detail, and
estimate its message length. Show working. State any assumptions.
[10 marks]
(c) The flow of the data
through the tree is shown in the diagram below.
Estimate the message length of (attribute 5 of)
the data in each of the two leaf nodes
below the node that tests attribute 1. Show working.
[5 marks]
(d) Estimate the change in total message length
(both tree and datatree)
if the two leaves and the att 1 testnode,
as identified in part (c),
were collapsed to one leaf; show working.
Should the change be made?
Why or why not?
[5 marks]
© 6/2003, L. Allison,
School of Computer Science and Software Engineering,
Monash University, Australia 3800.
Created with "vi (Linux & IRIX)", charset=iso88591

Note that attribute 2 is continuous, i.e. it could be "reused",
that is tested multiple times in a path from the root to a leaf.
The cost of transmitting a cutpoint, such as `41.1', depends on
whether it is the median (1bit), a quartile (3bits),
an octile (5bits), etc.,
of the subset of the data reaching that test.
Note that attribute 3 is 4valued, and that using 1bit
for fork/leaf is only optimal for binary trees.
