CSE454 2002

CSE454 Learning and Prediction I: Data Mining 29/4/2002

duration 1.5 hours + 10 minutes reading time.
Attempt all questions.
This paper counts for 60% of the subject's assessment.
Closed book.
No calculators.
The attached [tables] give sufficient accuracy for any estimates of probabilities and message lengths.

1. (a) Define the following terms as they apply to learning and prediction:

Prior probability.
Likelihood.
Posterior probability.
Bayes' theorem.

[10 marks]

(b) Three fair coins, Pr(H)=0.5, and one biased coin, Pr(H)=0.75, are placed in a cloth bag; all of the coins are identical in appearance and in weight. The bag is shaken and one coin is drawn out. The selected coin is thrown four times with these results: HHTH.

What is the prior probability of a fair coin versus a biased coin?
What is the likelihood of HHTH for a fair coin, and for the biased coin?
What is the posterior probability of a fair coin versus a biased coin?

[10 marks]

2. (a) For multi-state probability distributions, define the `Kullback Leibler' (KL) distance from the "true" probability distribution P₁ to another probability distribution P₂.

[8 marks]

(b) Below are two probability distributions, P and Q, over {A,C,G,T}. Calculate the KL distance from P to Q, and also from Q to P.

[12 marks]

-	A	C	G	T
P:	1/2	1/4	1/8	1/8
Q:	1/8	1/2	1/4	1/8

3. A library has a collection of 50 essays, all believed to have been written by author X. Professor Smith disagrees and believes that some of the essays were written by another person, author Y. Unfortunately no work exists that is absolutely certain to have been written by X, and no work exists that is absolutely certain to have been written by Y. It is assumed that each essay had a single author. It is a fact that different people tend to use words, including common words, with different frequencies. Smith gathers data from the essays, specifically the length of each essay and the number of times that each of the following common words appears in each essay: a, are, by, for, has, is, not, the, they, which.

(a) How would you apply Snob to this problem and this data? Why?

[10 marks]

(b) What signs would you look for in Snob's results to support the single-author hypothesis, or to support the two-author hypothesis, or other possible explanations of the data? Why?

[10 marks]

4. (a) Consider decision- (classification-) trees over multivariate data with discrete and/or continuous attributes to be tested and a discrete (`class') attribute to be predicted.
Describe an efficient scheme to encode (i) a tree, and (ii) the data given the tree.

[10 marks]

(b) A multivariate data set has the following attributes:: att 1: T, F i.e. a boolean attribute; att 2: T, F; att 3: T, F; att 4: T, F; att 5: a, b, c, d i.e. 4 possible values; att 6: +, - i.e. 2 possible values, positive and negative cases
att 6 is the `class', i.e. the attribute to be predicted.

If tree 1 and tree 2, below, are two possible models of the data, which tree is the better explanation of the data? Why?

[10 marks]

tree 1	tree 2

(The leaf nodes show the frequencies of positive and negative cases.)

Describe the encoding of this structure, including splits, and estimate its message length.

[10 marks]

(d) The tree models the new data as follows:

Describe how to encode the data given the tree, and give estimates (with working) for the data modelled by the left-most, [+:33,-:67], and the right-most, [+:1,-:9], of the four deepest leaves in the tree.

[10 marks]

Created with "vi (Linux & IRIX)", charset=iso-8859-1