(Decision) Classification Trees II

Have covered Trees over binary, e.g. T/F, Head/Tail, 0/1, M/F, ..., here cover

multi-state and
continuous attributes, and
some other generalizations.

This document is online at http://www.csse.monash.edu.au/~lloyd/Archive/2005-08-Tree02/index.shtml and contains hyper-links to other resources.

<< [02] >>

decision tree with binary and ternary splits

F F L L F L L L

<< [03] >>

F F L L F L L L

The old stopping rule that #L=#F+1 only works for binary trees.

But if the arity of each fork is known before its sub-trees are transmitted then subtrees can be matched to parents and the end of the message can be detected.

A full ternary tree with `n' forks has 2n+1 leaves: proof by induction (consider replacing a leaf with a fork + children).
A full tree of constant fan-out, with `n' forks has (fan_out-1)×n+1 leaves.

So P(F) and P(L) are no longer equal in general. e.g. If all attributes are ternary, odds of F:L = 1:2, i.e. P(F) = 1/3, P(L) = 2/3.

<< [04] >>

When arities of attributes differ, a practical solution is to:

Assume P(F) = P(L) = 0.5 for root node.

Use the arity of a node to estimate P(F) & P(L) for its sub-trees. (Necessary to give code-words for F and L.)

P(F) = 1/arity, P(L) = (arity-1)/arity

Use this rule recursively.

(NB. No problem with n-ary attributes in leaves. Use the [multi-state] distribution.)

<< [05] >>

Continuous (Split) Attributes

e.g. height, weight, temperatures, ...

The Wallace - Patrick solution is to use the values in the attribute (common knowledge) as potential splitting values,

sort values, and allocate code-words thus:

              quartile    median      quartile      etc.
      -----|-----|-----|-----|-----|-----|-----|-----

P:        1/32  1/8   1/32  1/2   1/32  1/8   1/32
code:    00100  011  00101   1   00110  010  00111

msgLen:    5     3     5     1     5     3     5    bits

An utterly insignificant amount of lost probability can be saved by renormalising so that the finite series sums to 1.0.

This method allows a split-value to be stated imprecisely and cheaply, e.g. just one bit for the median, or more precisely and expensively if justified.

<< [06] >>

Search Problem

MML, msgLen(T) + msgLen(D|T), allows two trees to be compared fairly and the better one chosen.

The search problem is to find the best tree of all (hopefully), or at least to find a good tree.

Greedy search with limited lookahead of `L-levels' works well, where L is a small integer L>=0.

NB. Generalizations of the model space make the search problem harder.

<< [07] >>

Generalized Leaves

Split-nodes of a tree partition data by hyper-planes perpendicular to the axes of the data measurements.

Each leaf models data in one element of the partition.

N-state data is modelled by n-state distribution.
Continuous data can be modelled by a continuous distribution.
Multi-variate data can be modelled by a multi-variate distribution.
- in principle.

<< [08] >>

Generalized Splits

n-state discrete

Fragmentation: High-arity discrete splits quickly divide training data into small sets, stopping elaboration of the tree.

In principle, a binary-split is possible on an n-state attribute by specifing a set of values to go left, complement goes right. Message length is n-bits per test.

other

In principle, arbitrary functions can be used to send data to left or right (or more) sub-trees, but their parameters must be stated (cost).

NB. Generalizations make the search problem harder.

<< [09] >>

Search

Typically:

Greedy search, with
limited lookahead of L levels, L>=0

<< [10]

Conclusions

Treated

n-ary (split) attributes
- modified coding of the tree
continuous (split) attributes
- coding of the split value(s)
- precision
possible generalizations of trees

References

C. S. Wallace & J. D. Patrick. Coding Decision Trees. Machine Learning 11 pp7-22, 1993.
- Introduced the tree code and message length calculations for decision trees, tests demonstrate good generalization and resistance to over-fitting. [Notes]

Created with "vi (IRIX)", charset=iso-8859-1