^CSE454^ [01] >>

Classification- (Decision) Tree

See L. Allison, Models for Machine Learning and Data Mining in Functional Programming, J. Functional Programming (JFP), 15(1), pp.15-32, January 2005, and also [II].

CSE454 2005 : This document is online at   http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/   and contains hyper-links to other resources - Lloyd Allison ©.

^CSE454^ << [02] >>

Partitioning on input attributes

Discrete attribute:
k-way split where k is the `arity' of the attribute
Continuous attribute:
Try median, left-quartile, right-quartile, octiles, ...
NB. depends on current (part of) data set
Merge the splitting options for all attributes.
^CSE454^ << [03] >>
data CTreeType inSpace opSpace =

  CTleaf (ModelType opSpace)   |

  CTfork MessageLength
         (Splitter inSpace)
         [CTreeType inSpace opSpace]

  | ...

  -- can also consider other non-traditional options

A `Splitter' can be used to partition the input space.

^CSE454^ << [04] >>
instance SuperModel (CTreeType inSpace opSpace) where

 -- NB. For simplicity only, this costs the
 -- structure at 1-bit per node.  This is
 -- only optimal for binary trees.

 msg1 (CTleaf leafModel)
   = log2 + msg1 leafModel

 msg1 (CTfork fnLen f dts)
   = log2 + fnLen + (foldl (+) 0 (map msg1 dts))

Message length of a tree is that of the node plus those of the sub-trees, if any.

^CSE454^ << [05] >>

Make a Classification Tree an instance of a FunctionModel:

instance FunctionModel CTreeType where

  condModel (CTleaf leafModel) i = leafModel

  condModel (CTfork fnLen f dts) i
    = condModel (dts !! (applySplitter f i)) i
^CSE454^ << [06] >>

estCTree : Estimate a Classification Tree

It is convenient to have an inner, local ``search'' function:

estCTree  estLeafMdl splits  ipSet opSet =
 let  -- NB. estLeafMdl must be able to
      --     handle an empty data set

  search ipSet opSet =
       . . .
^CSE454^ << [07] >>

Simplest possible tree is a single leaf node:

leaf    = CTleaf leafMdl

leafMdl = estLeafMdl opSet

leafMsg = msg (functionModel2model leaf)
              (zip ipSet opSet)
^CSE454^ << [08] >>

simple recursive search for the best (1- or) 2-level tree;

base case:

alternatives  []
 bestML bestCTree bestIpParts bestOpParts
 = (bestCTree, bestIpParts, bestOpParts)  -- done


^CSE454^ << [09] >>

...general case:

alternatives (sp:sps)
 bestML bestCTree bestIpParts bestOpParts =
  -- NB. the `1' below is acceptable but not optimal
  theTree  = CTfork (log (fromIntegral (length splts)))
                    sp leaves
  leaves   = map CTleaf leafMdls     -- one leaf ...
  leafMdls = map estLeafMdl opParts  -- ... per part
  partNums = map (applySplitter sp) ipSet
  ipParts  = partition (aritySplitter sp) partNums ipSet
  opParts  = partition (aritySplitter sp) partNums opSet
  msgLen   = msg (functionModel2model theTree)
                 (zip ipSet opSet)
  if msgLen < bestML
  then alternatives sps msgLen theTree ipParts opParts
  else alternatives sps bestML bestCTree bestIpParts bestOpParts
^CSE454^ << [10] >>
  splts = splits ipSet

  case  alternatives splts leafMsg leaf [ipSet] [opSet]  of

   -- subtrees?
   ((CTfork msgLen pf leaves), ipParts, opParts) ->
     CTfork msgLen pf (zipWith search ipParts opParts);

   -- the single leaf wins?
   (t, _, _) -> t

in search ipSet opSet

-- ----------9/2002--6/2003--L.Allison--CSSE--Monash--.au--
^CSE454^ << [11] >>


Can estimate (fit, infer, learn) a classification-tree given (i) ways of partitioning the data on input attributes and (ii) an estimator for leaf models.
NB. Any kind of leaf model can be used -- discrete, continuous, or multivariate;   even...

... FunctionModel- (Regression-) Trees

Use a conversion function estFunctionModel2estModel:

estFunctionModel2estModel estFn ipOpPairs =
  functionModel2model (uncurry estFn (unzip ipOpPairs))

ft = estCTree (estFunctionModel2estModel estFnMdl)
        splits trainIp trainOp

L. Allison. Models for Machine Learning and Data Mining in Functional Programming, J. Functional Programming (JFP), 15(1), pp.15-32, January 2005, and also [II].

© 2005 L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.
Created with "vi (IRIX)",   charset=iso-8859-1