[Compress/]

DNA Sequence Compression (Random) Example

> Compress: L.Allison, Computer Science, Monash University 4/1998
     1 GGATATCACG TAGTCCCTAG CTCTTGGCGC TGGATGGGGC GGACGGAAGG
    51 GAAACGACCG TTGAATTCCA AATTCGGTCG TATGGAAATA TTGCAATGGA  100

> order-0 Markov Model
>                                                                                                    |   4.0 +
>                                                                                                    |   3.5 b
>                                                                                                    |   3.0 b
>      . .     ...   . .    . .         .   .          .  ..        ..     .   .              .      |   2.5 b
>--....-.--..-.---..--.-..-----.--..-------.---..---...--.---..-....--.....---.--...--.......--...--.|-  2.0 b
>..       .  .      .     .. .  ..  .... ..  ..  ...    .   .  .            ..  .   ..       .    .. |   1.5 b
>                                                                                                    |   1.0 b
>                                                                                                    |   0.5 b
>                                                                                                    |   0.0 b
> compress: Sequence length=100, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =8.2 bits
> data:      (D|H) =197.4 bits, =1.9742 b/ch
> total: (H)+(D|H) =205.6 bits, =2.0564 b/ch
> ran 00/01/21  from 15:33:36  to 15:33:36  

> order-1 Markov Model
>                                                                                                    |   4.0 +
>                                                                                                    |   3.5 b
>                                                                                                    |   3.0 b
>.      .. . ..     ..      . .         .   .    .     .  .  .        .       .  .            ..     |   2.5 b
>--.-.-.----.--.....--....-----.--.--------.---.----.----.-.--.-.--...----..---.--.---.---.-.-------.|-  2.0 b
> . . .   .               .. .  .. ..... ..  .. . .. .. .   .  . ..    ...  ..  .  ... ... . .  .... |   1.5 b
>                                                                                                    |   1.0 b
>                                                                                                    |   0.5 b
>                                                                                                    |   0.0 b
> compress: Sequence length=100, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =21.1 bits
> data:      (D|H) =191.0 bits, =1.9097 b/ch
> total: (H)+(D|H) =212.1 bits, =2.1208 b/ch
> ran 00/01/21  from 15:33:36  to 15:33:36  

> AED fwd approx repeats
> [Frequencies B:99.1 R:0.3 C:0.6 E:0.3 =:0.8 ~:0.1 i:0.0 d:0.0 tot:101.3]
> [Frequencies B:98.5 R:0.7 C:0.9 E:0.7 =:0.7 ~:0.4 i:0.3 d:0.3 tot:102.7]
> [Frequencies B:97.9 R:1.1 C:1.2 E:1.1 =:0.8 ~:0.6 i:0.6 d:0.6 tot:104.0]
>                                                                                                    |   4.0 +
>                                                                                                    |   3.5 b
>                                                                                                    |   3.0 b
>.      .. . ..     ..      . .         .   .    .     .  .  .        .       .  .            ..     |   2.5 b
>--.-.-.----.--.....--....-----.--.--------.---.----.----.-.--.-.--...----..---.--.---.---.-.-------.|-  2.0 b
> . . .   .               .. .  .. ..... ..  .. . .. .. .   .  . ..    ...  ..  .  ... ... . .  .... |   1.5 b
>                                                                                                    |   1.0 b
>                                                                                                    |   0.5 b
>                                                                                                    |   0.0 b
> compress: Sequence length=100, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =31.6 bits
> data:      (D|H) =191.1 bits, =1.9110 b/ch
> total: (H)+(D|H) =222.7 bits, =2.2269 b/ch
> ran 00/01/21  from 15:33:36  to 15:33:38  
> --- end ---

------------------------------------------------------------------------------

Note that the more complex models seem to give better compression when you ignore the matter of stating their parameters. In reality, i.e., when the parameters are included, the simplest model is best (and the 'uniform' model would be even better).

Also see [http://www.allisons.org/ll/Bioinformatics/Compress/].