 >>

# (DNA) Sequence Complexity and Alignment.

### Lloyd Allison, Computer Science and Software Engineering, Monash University, Victoria, Australia 3168

#### http://www.csse.monash.edu.au/~lloyd/tildeStrings/

This document is online at   users.monash.edu/~lloyd/Seminars/2001-WEHI/index.shtml   and contains hyper-links to more detailed notes on certain topics and to some of the references.

 Also see Modelling v. Shuffling. <<  >>

The story...

• (Probabilistic) Finite State Automata (PFSA)
= Hidden Markov Models (HMM)

• Model complexity

• Alignment Algorithms

• Cost functions (gaps)

• Compression

• Alignment and Compression. <<  >> A very (!) old problem:

Learn a language.
• Babies do it.

• Gaines 1976:
CAAAB/ BBAAB/ CAAB/ BBAB/ CAB/ BBB/ CB

• Wallace & Georgeff 1983:
PFSA (right)

• Model complexity v. fit <<  >> Model Complexity:

• Bayes (1763)
• P(H|D) = P(H) . P(D|H) / P(D)

• Fisher
• Shannon (c1948)
• msgLen(E) = -log2(P(E)) in optimal code, so  . . .
• msgLen(H&D) = msgLen(H) + msgLen(D|H)
= msgLen(D) + msgLen(H|D)
• msgLen(H) = model complexity
• msgLen(D|H) = data complexity | H <<  >>
Alignment, [c1990]:

Mutation Machine Mutation instructions: copy, change, insert, delete. <<  >>

Equivalent Generation Machine: Generation instructions: match, mismatch, insert1, insert2.
(A.K.A. Pair Hidden Markov Models (PHMM).) <<  >>
What matters most is the "control box" (model):
• e.g. 3-state machine   ~   [linear gap costs]: • e.g. 5-state machine (more complex) ~ piece-wise linear gap costs,  etc. <<  >>
A twist: Sum over all Alignments, fwd+back, alignment density: <<  >>

Compression   ~   Sequence Complexity: (Also reverse-complementary approximate repeats.) <<  >>

What matters most is the control box, e.g. <<  >>
2-D Plot from Compression Algorithm. p.falciparum, chr2, 0.95Mbp [colour] [b+w] [1-D plot] bits/b.p. NB. Same twist - sum over all explanations. But not enough space for fwd+rev for long Seq'. <<  >>

# Compression and Alignment.

• A sequence is compressible if you can predict (to some extent) what is coming next.

• Two sequences are related if one gives you useful information about the other.

• When aligning non-random sequences we have two sources of information:
• A sequence tells about itself and
• one seq' tells about the other.
• Must combine the sources. <<  >>

Seq_1 . . . S1[i] . . .
S
e
q
2
.

.
.
.

```
.
(mis)match .
.
.```
` | | | insert | | v`
S2[j]

```-------> delete  `````` ```?

.
.
. <<  >>
To combine the sources of information, e.g. can
(a) build "direct product" of sub-models, Powell et al (1998), or
(b) say, use this approximation: . . .

Instruction probabilities   (costs = -log2( . )):
• P(match) . ( P(S1[i] | S1[1 .. i-1]) + P(S1[i] | S2[1 .. j-1]) )/2

• P(mismatch) . ( P(S1[i] & S2[j] | S1[1 .. i-1] & S2[1 .. j-1] & S1[i]!=S2[j])

• P(insert) . P(S2[j] | S2[1 .. j-1])

• P(delete) . P(S1[i] | S1[1 .. i-1])
See [Comp. Jrnl. (1999)] << 

# Results:

#### Some References

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1