Given: a data-set on a certain protein from various individuals.
The data set includes a variable (attribute, column), mSets,
which is set-valued.
For an individual,
the set contains the positions at which mutations are observed
in the protein compared to the "normal" protein.
ms is the union of all the individual sets;
100+ positions have been observed to mutate at least once.
Most of the sets are sparse and contain fewer than six mutations.
The mutations could be stored as 100+ Boolean variables but
a set-valued representation is convenient, and
is the way that the data was presented.
We want to see if knowing whether or not a set contains mutation 1
tells us anything about whether or not it is likely to contain mutation 2.
We chose pairs of different mutation positions,
mut1 and mut2 from ms.
Compute the information content of
mut2, first independent of mut1 by
indep = estMultiState mut2s,
indepCost = msg2 indep mut2s, and
then dependent on mut1 by
depd = estFiniteFunction mut1s mut2s,
depdCost = msg2 (functionModel2model depd) mut12s.
The difference in these costs shows how much information, if any,
knowing about mut1 gives us on mut2.
(Note this quantity is not necessarily symmetric
between mut1 and mut2.)
The savings are sorted and negative ones discarded.
pairwise mSets ms =
((sortBy (\((_, _, s1):_) -> \((_, _, s2):_) -> compare s2 s1))
[ m1s |
mut1 <- ms, -- for each mutation position, mut1
mut1s = map (elem mut1) mSets
m1s = ((takeWhile (\(_, _, s) -> s > 0))
.(sortBy (\(_, _, s1) -> \(_, _, s2) -> compare s2 s1)))
[ (mut1, mut2, save) |
mut2 <- ms, -- for each mutation position, mut2
mut1 /= mut2, -- must differ
mut2s = map (elem mut2) mSets
mut12s = zip mut1s mut2s
indep = estMultiState mut2s -- 2
depd = estFiniteFunction mut1s mut2s -- 2|1
indepCost = msg2 indep mut2s --cost 2
depdCost = msg2 (functionModel2model depd) mut12s --cost 2|1
save = indepCost - depdCost
] -- mut2 within mut1
map, zip, elem, takeWhile,
sortBy, and list comprehensions [exp| x<-s, ...]
are standard Haskell-98 features.
msg2, and functionModel2model
are part of the Inductive Programming machine-learning
(See TID for more information on this problem.)