Inductive Programming: Data

## Data

A type system probably cannot(?) do all of the following:

Some variables have an origin, some do not, e.g.,
position has an origin.
length has a scale but does not have an origin.
A length may be multiplied but a position may not(?) be multiplied.
Two positions may not be added, but
the difference between two positions is a length.
Units, e.g.,
feet and inches v. m., cm, and mm.
and dimensions, e.g.,
length, mass, time, electric-current, temperature, amount-of-substance (mole), luminous-intensity (candela)
-- 7 base SI dimensions for physics.
acceleration = length.time-2
mass x acceleration = m.length.time-2 = force = momentum/time

#### Some types, and classes, of data.

 atomic data T = C1 | C2 | ... | Cn deriving (Bounded, Enum) e.g. data Boolean = True | False e.g. data Gender = Male | Female e.g. data DNA = A | C | G | T e.g. data Party = Liberal | Labor | Democrat | Green | Indep NB. Something changes qualitatively for a "large" number of categories, maybe even for 7+ or 10+. NB. Bounded and Enum do not imply any semantic (non-arbitrary) order on the values; see ordered below. as above plus  deriving (..., Ord) e.g. data Quality = Bad | Poor | Avg | Fair | Good e.g. data Topography = Mountains | Foothills | Plain | Coastal See [missing persons]. e.g. reptile | mammal( rodent | primate( chimp | gorilla)) One method... data Animal = Reptile | Mammal (Maybe M) data M = Rodent | Primate (Maybe P) data P = Chimp | Gorilla Is a primate ~ Mammal (Primate Nothing), => is a mammal ~ Mammal Nothing ::Animal.   Model by a suitable collection of multistate models. Or set-based, Primate = {Chimp, Gorilla} etc., c.f. DNA. Also see measurement accuracy, discrete. >0 similarly non-neg. >=0 e.g. day of the week, month. Float, Double e.g. voltage, position (1D), velocity (1D) (rl,im) or (r,θ)see vector e.g. mass, length, speed e.g. angle tuple: (T1, T2, ..., Tm) or constructor: data Person = Person String Int or: data Person = Person{name::String, age ::Int} or array (homogeneous) or list, [t], (homogeneous) array (homogeneous) e.g. m-Dim. position, velocity, force, etc. list: [t] --list of t e.g. DNA seq., annual weather data, daily stock exchange data, visits to doctor, etc. Element type can be multivariate. list of members, e.g. [set of mutations], or vector (bit map), (equiv. in principle but not necc. in practice, esp. sparse sets). the sky is the limit, new data-types usually structured data, e.g. data Person = Male | Female Int --#pregnancies! Maybe t, but different symantics from missing data (below)! Also Either t1 t2 -- standard H98. Really a special kind of structured data. Model as discrete plus a suitable model for t. Whether an optional t was in fact present or not could be missing, and if it was known to be present then the value itself could be missing or not! (a) fixed, ±δ, or (b) relative, ±x%, or (c) arbitrary, range (lo,hi), per datum. NB. omitting to deal with accuracy in a data trans. can affect inferences; safer to inverse transform model. e.g. DNA H={A,C,T}, ..., R={A,G}, Y={C,T}, K={G,T}, N={A,C,G,T} ~missing?! (A 4-bit "set" rep. works nicely for many purposes.) Also see hierarchic, partially ordered, above. Maybe t = Nothing | Just t  --H98 standard type There was a value but it was either not measured or not recorded. (a) Missingness is common knowledge; need not be coded at all. (b) Missingness is of known prob.; can code using a fixed given prob.. (c) Missingness is to be estimated once, globally, for use in all sub-models. (d) Missingness is to be estimated per sub-model, and so may influence global model structure. See [modelMaybe]. either: data Cnsrd t = Cnsrd | Normal tor transform the model. Related to missing, and optional, but with different semantics. E.g. A "sticky" voltmeter measures [0.0 .. 1.0]v as 0.0v. A reasonable, although not perfect, way to model censored data is similar to what can be done for missing data, cases (c) or (d), above. (As in ecological segmentation '05.) (i) integral: compacting repetitive values, (ii) fractional: part membership of a class in a [mixture model].
25/5/2006, LA.
