^up^ [01] >>

>Tree II> |

*fork nodes*, which split data, specify tests on input attributes, and*leaves*, which model dependent attribute.

Expert systems: Examples are given, e.g. case histories or expert judgement, the learned Tree mimics expert behaviour.

Consider binary attributes first, n-ary and continuous attributes later.

This document is online at http://www.csse.monash.edu.au/~lloyd/Archive/2005-08-Tree01/index.shtml and contains hyper-links to other resources.

<< [02] >>

Engine turns over but will not run. ?Is it a problem with the ignition system (spark plugs, leads, coil, distributor or equivalent)?

We have a *training set* of cases, i.e. examples:

1: engine fires | 2: clicking noise | 3: petrol smell | 4: ignition problem | note | |

1: | y | n | n | n | fuel |

2: | y | n | n | y | plugs |

3: | y | y | n | y | leads |

4: | n | y | n | y | leads |

5: | y | n | y | n | flooded |

6: | n | n | y | n | leak |

7: | n | n | y | y | plugs |

8: | y | y | n | y | |

etc. | ... | ... | ... | ||

. . . etc. . . . |

<< [03] >>

<< [04] >>

<< [05] >>

P(Data&Tree) = P(Tree) × P(Data|Tree) MsgLen(Data&Tree) = MsgLen(Tree) + MsgLen(Data|Tree)

A simple tree has a high *prior* probability,
i.e. a short message length.

A complex tree has a low prior probability, i.e. a long message length.

These quantities are commensurable with msgLen(Data|Tree), i.e. the fit.<< [06] >>

Deal with this in 3 parts

- topology,

- for each fork: split attribute,

- for each leaf: distribution parameters.

<< [07] >>

F F L F L L F L L

<< [08] >>

F F L F L L F L L

**NB.**It is well known that every binary tree has exactlyone more leaf node than it has forks.

- Set P(F) = P(L) = 0.5

- e.g. The given tree topology costs 9 bits.

- Can detect the end of the code when
#L = #F+1

<< [09] >>

- It takes -log
_{2}(#a) bits to specify one of #a attributes, - assuming all are equally likely,

- e.g. log
_{2}(3) bits for the root node and the given tree.

- A split attribute in a fork-node cannot be re-used in a subtree.

- e.g. log
_{2}(2) = 1 bit for the next split attribute - and -log
_{2}(1) = 0 bits for a 3rd-level split.

- e.g. Total
= log
_{2}(3) + 2×log_{2}(2) + log_{2}(1) = log_{2}(3) + 2 bits

<< [10] >>

- A
binomial distribution
is used for a binary attribute, @
_{predicted}.

- State one parameter, i.e. P(@
_{predicted})=T,*to optimum, finite accuracy*

- Do not code an explicit distribution in a leaf,

- instead use the
*adaptive coding method*for items arriving at the leaf,

- simple & data message length is a fraction of a bit less per leaf.

<< [11] >>

NB. Only applies to predicted attribute(s), input attributes are given, i.e. common knowledge.

(# @_{predicted}= T) × -log_{2}P'(T) + (# @_{predicted}= F) × -log_{2}(1-P'(T))

The MML estimator for a 2-state distribution is:

P'(T) = (#T_{obs}+ 1/2) / (#T_{obs}+ #F_{obs}+ 1) P'(F) = (#F_{obs}+ 1/2) / (#T_{obs}+ #F_{obs}+ 1)

<< [12] >>

- Tree consists of fork-nodes (splits, tests) and leaf distributions

- Tree message length (cost)
- topology
- for each fork: split attribute
- for each leaf: distribution parameters

- Data message length (cost) given Tree
- item cost given 2-state (or multi- etc.) in selected leaf

- allows simple trees and complex trees to be compared fairly.

- C. S. Wallace & J. D. Patrick.
*Coding Decision Trees.*Machine Learning 11 pp7-22, 1993.

- Introduced the tree code and message length calculations for decision trees, tests demonstrate good generalization and resistance to over-fitting. [Notes]

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800. Created with "vi (IRIX)", charset=iso-8859-1