^CSE454^ ^Local^ ^Sample-Q's^

### Computer Science and Software Engineering, Monash University, Australia, CSE454 Learning and Prediction I:Data Mining, (semester 1), 13th June 2003

• Duration 1.5 hours + 10 minutes reading time.
• Attempt all questions.
• This paper counts for 60% of the subject's assessment. Maximum possible marks here: 90.
• This is a closed book exam.
• No calculators.
• The attached [tables] give sufficient accuracy for any estimates of probabilities and message lengths. Where appropriate, show a numerical result as a simplified expression and also as an estimated value, and state any assumptions.

1. (a) Briefly, define the following terms as they apply in `Learning and Prediction I':

1. Joint probability.
2. Independent (as used with probability).
3. Bayes's theorem.
4. Information.
[10 marks]

(b) Four tetrahedral dice are placed in a cloth bag. The dice are identical in appearance and in weight. Three of the dice are unbiased, Pr(1) = Pr(2) = Pr(3) = Pr(4) = 1/4 = 0.25.  One of the dice is biased, Pr(1) = 1/2 = 0.5, Pr(2) = 1/4 = 0.25, Pr(3) = Pr(4) = 1/8 = 0.125. The bag is shaken and one dice is drawn out. The selected dice is rolled four times with these results:  1, 4, 1, 2.

1. What is the prior probability of a fair dice versus a biased dice?
2. What is the likelihood of 1,4,1,2 for a fair dice, and for a biased dice?
3. What is the posterior probability of a fair dice versus a biased dice?
[10 marks]

>------notes------>

2. (a) For a multi-state probability distribution, P(.), define how to calculate the `Entropy', and give an explanation of what the quantity represents.

[5 marks]

(b) What is the entropy of the following distribution for a 6-sided dice?

[4 marks]
 1 2 3 4 5 6 1/2 1/4 1/16 1/16 1/16 1/16

(c) For multi-state probability distributions, define the `Kullback Leibler' (KL) distance from the "true" probability distribution P(.) to another probability distribution Q(.).

[6 marks]

(d) Below are two probability distributions, P and Q, for dice. Calculate the KL distance from P to Q.

[5 marks]
 1 2 3 4 5 6 P: 1/6 1/6 1/6 1/6 1/6 1/6 Q: 1/2 1/4 1/16 1/16 1/16 1/16

3. You are observing a multi-player game that is played with a silver coin, `S', and a gold coin, `G'. Each turn involves throwing the two coins. You notice that <H,H> and <T,T> seem to occur rather often and <H,T> and <T,H> rather infrequently. You watch the game, collect some data on the coins (right) and decide to use MML to analyse it (of course a real data set would be much larger).

(a) You model the data as a mixture of one or more classes. Each class is defined by a bivariate model. Each bivariate model consists of two 2-state distributions, i.e. one 2-state distribution for S and another for G.
Carefully list everything that must be specified in a message (i) for a one-class "mixture" and the data, and (ii) for a two-class mixture and the data.
Estimate the message lengths (the parts and totals) for the best one-class and the best 2-class mixtures. Show working. (You may use "total assignment" of observations to classes in order to simplify calculations.)

[15 marks]

(b) One could convert the data as follows: <H,H> -->0, <H,T> -->1, <T,H> -->2, <T,T> -->3 and could model the transformed univariate data by a 4-state distribution. In this case, does it now make sense to talk about a one-class v. a two-class model, or not? If so, why and what are the consequences, otherwise why not?

[5 marks]

trialSG
1HH
2TT
3TH
4HH
5TT
6HH
7HT
8TT

(a) Data nicely arranged for Snob: `trial' is the `thing' number, `S' is attribute 1, and `G' is attribute 2.
You could try the data through Snob (replicate the data a few thousand(?) times to a realistic size).

The 1-class model will have <Pr(S=H)=0.5, Pr(G=H)=0.5>. (The factorial method lets you estimate the cost of model and data.)
The 2-class model can bias class-1, say, towards <H,H> and class-2 towards <T,T>.
Put in some plausible numbers and . . .

(b) No: Note that a mixture of two or more 4-state distributions is a 4-state distribution. They are indistinguishable. (The same is not true of mixtures of Normal distributions, for example.)

4. (a) Consider classification- (decision-) trees over multivariate data with discrete and/or continuous attributes to be tested and a discrete (`class') attribute to be predicted.
Describe an efficient scheme to encode (i) a tree, and (ii) the data given the tree.

[10 marks]

A multivariate data set has the following attributes:
att 1: M, F   -- gender.
att 2: continuous   -- age measured to one decimal(!) place, e.g. 42.3.
att 3: g, v, w, j   -- 4 possible values.
att 4: boolean
att 5: boolean   -- the `class', i.e. the attribute to be predicted.
There are 661 data items.

(b) Describe the encoding of the following tree (without data) in reasonable detail, and estimate its message length. Show working. State any assumptions.

[10 marks]

(c) The flow of the data through the tree is shown in the diagram below. Estimate the message length of (attribute 5 of) the data in each of the two leaf nodes below the node that tests attribute 1. Show working.

[5 marks]

(d) Estimate the change in total message length (both tree and data|tree) if the two leaves and the att 1 test-node, as identified in part (c), were collapsed to one leaf; show working. Should the change be made? Why or why not?

[5 marks]

© 6/2003, L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1

Note that attribute 2 is continuous, i.e. it could be "reused", that is tested multiple times in a path from the root to a leaf.
The cost of transmitting a cut-point, such as `41.1', depends on whether it is the median (1-bit), a quartile (3-bits), an octile (5-bits), etc., of the subset of the data reaching that test.

Note that attribute 3 is 4-valued, and that using 1-bit for fork/leaf is only optimal for binary trees.