^up^ [01] >>

<Tree I< >Graph> |

Have covered Trees over binary, e.g. T/F, Head/Tail, 0/1, M/F, ..., here cover

- multi-state and

- continuous attributes, and

- some other generalizations.

This document is online at http://www.csse.monash.edu.au/~lloyd/Archive/2005-08-Tree02/index.shtml and contains hyper-links to other resources.

<< [02] >>

`F F L L F L L L`

<< [03] >>

F F L L F L L L

The old stopping rule that #L=#F+1 only works for binary trees.

But *if* the arity of each fork is known
before its sub-trees are transmitted
then subtrees can be matched to parents and
the end of the message can be detected.

- A
*full*ternary tree with `n' forks has 2n+1 leaves - proof by induction
(consider replacing a leaf with a fork + children).

- A
*full*tree of constant fan-out, with `n' forks has (fan_out-1)×n+1 leaves.

So P(F) and P(L) are no longer equal in general.
*all*

<< [04] >>

- When arities of attributes
*differ*, a practical solution is to:

- Assume
P(F) = P(L) = 0.5 for root node.

- Use the arity of a node
to estimate P(F) & P(L) for its sub-trees.
(Necessary to give code-words for F and L.)

- P(F) = 1/arity, P(L) = (arity-1)/arity

- P(F) = 1/arity, P(L) = (arity-1)/arity
*Use this rule recursively*.

<< [05] >>

e.g. height, weight, temperatures, ...

The Wallace - Patrick solution is to use the values in the attribute (common knowledge) as potential splitting values,

sort values, and allocate code-words thus:

quartile median quartile etc. -----|-----|-----|-----|-----|-----|-----|----- P: 1/32 1/8 1/32 1/2 1/32 1/8 1/32 code: 00100 011 00101 1 00110 010 00111 msgLen: 5 3 5 1 5 3 5 bits

An utterly insignificant amount of lost probability
can be saved by renormalising so that the *finite* series sums to 1.0.

<< [06] >>

MML,
msgLen(T) + msgLen(D|T),
allows two trees to be *compared* fairly and
the better one chosen.

The *search problem* is to find
the best tree of all (hopefully), or
at least to find a *good* tree.

Greedy search with limited lookahead of `L-levels' works well, where L is a small integer L>=0.

<< [07] >>

Split-nodes of a tree partition data by hyper-planes perpendicular to the axes of the data measurements.

Each leaf models data in one element of the partition.

- N-state data is modelled by
n-state
distribution.

- Continuous data can be modelled by a
continuous
distribution.

- Multi-variate data can be modelled by a multi-variate distribution.

- - in principle.

<< [08] >>

*Fragmentation:*
High-arity discrete splits quickly divide training data
into small sets, stopping elaboration of the tree.

In principle, a binary-split is possible on an n-state attribute
by specifing a *set* of values to go left,
complement goes right.
Message length is n-bits per test.

In principle, arbitrary functions can be used to send data to left or right (or more) sub-trees, but their parameters must be stated (cost).

<< [09] >>

- Greedy search, with

- limited lookahead of L levels, L>=0

<< [10]

- n-ary (split) attributes
- modified coding of the tree

- continuous (split) attributes
- coding of the split value(s)
- precision

- possible generalizations of trees

- C. S. Wallace & J. D. Patrick.
*Coding Decision Trees.*Machine Learning 11 pp7-22, 1993.

- Introduced the tree code and message length calculations for decision trees, tests demonstrate good generalization and resistance to over-fitting. [Notes]

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800. Created with "vi (IRIX)", charset=iso-8859-1