Computational Molecular Biology

Computing for Molecular Biology.

Computer Science and Software Engineering, Monash University, Australia 3168.

[D N A as spheres] [D N A ball and stick]

DNA is a very long molecule indeed; these computer-generated pictures are of just tiny sections. To sequence DNA requires lab' work and the assistance of computer programs. Once a sequence of the basic building blocks, or bases (A,T,C,G), has been worked out, other computer programs are used to analyse it in various ways.

Mapping

The human DNA sequence is approximately 3 billion bases long but biologists can only sequence sections of about 300 bases at one time. To get around this problem, biologists first collect hundreds of thousands of random extracts approximately 30,000 bases long, from the DNA called clones:

The first stage of large-scale sequencing is to locate each clone in a physical map. Biologists cut each clone wherever a short restriction site (e.g. GAATTC) occurs and measure the lengths of the resulting fragments. If two clones overlap, then they will have some fragments of similar length in common.

a t a t a t c t c g c g c t a g a t g a c t G A A T T C g a c t g a c t g a c t g g c t c t t c g g c c c c t a t a t a a a t a g c g c g c t a t g a c t t a a a g a c t c t t a c t t t t a c t a t t a t c t a t a t c t a t a t a c t t a c t c t c t c t a t a t a G A A T T C c t a t c a t c a a a a a a a a t c t c t c t

This short region of DNA is cut in two places, leaving 3 fragments. If this region is contained in two overlapping clones, then we can discover this arrangement:

Computer programs use the fragment lengths to calculate how likely it is that each pair of clones overlap. The matrix shows how this is used to order the clones. Good overlaps (shown in blue) are concentrated along the diagonal of the matrix by a computer program.

Here is a small section from a real map:

7.64  3.18  4.21  1.91  1.07  3.66  2.50  6.47
0.65  3.18  4.75  1.91  1.07  3.66  2.45  2.26  XXXX  1.87  8.55
                        1.37  3.52  XXXX  2.25  0.50  1.84  8.65
                                                      1.53  2.45  2.26  XXXX  1.83  8.63

A clone is still too big to sequence in one step. The clone fragments are broken, at random, into smaller overlapping pieces. Pieces that are about 300 bases long are sequenced individually. Another computer program works out how these short sequences overlap by looking for shared sub-sequences.

Sequence Alignment

sequence alignment probability density plot

When a complete DNA sequence is known, it can be compared to databases of other sequences to understand how it evolved and how it works. DNA mutates through the insertion, deletion and change of bases. An alignment density plot (pictured) can show the many ways in which two sequences might be related.

Molecular Biology data are derived from laboratory experiments. Experimental error and natural variation introduce "noise" and uncertainty into the data. Minimum Message Length encoding [MML] has been used to deal with this difficulty.

Computer Science

Computer Science at Monash University has research projects in Computing for Molecular Biology. Computer Science units on Algorithms, Data Structures, and Parallel Computing provide the necessary computing background to do Computing for Molecular Biology.

DNA images: J. Suhnel, Image Library of Biological Macromolecules, Institute of Molecular Biotechnology, Jena Germany.
© L. Allison, D. Platt / 1995-1998