Algorithms

Next: Scores Up: Concepts and conventions Previous: The NULL model Contents

Algorithms

The algorithms are then based around this central model, but have a variety of features removed from it progressively, either due to biological constraints (bacterial sequences have no introns, so there is no need to model them) or to speed up the the algorithm.

Algorithms are named in two parts, descriptive-word state-number:transition-number. The descriptive word indicates the biological model. At the moment there are 2 such biological models in the package

genewise: comparisons of protein information to genomic DNA
estwise: comparisons of protein information to cDNA/bacterial DNA (no introns)

There are many other models being worked on in development

sywise: comparisons of genomic DNA to genomic DNA
parawise: comparions of cDNA to cDNA

The state-number:transition-number is the number of states in the model followed by the number of transitions. GeneWise 21:93 is the most complicated model, with 21 states and 93 transitions. The number of states is directly proportional to the memory usage of the program. The number of transitions is roughly proportional to the CPU time of the algorithm. For comparison the standard smithwaterman algorithm is a 3:7 algorithm (3 states, 7 transitions). These numbers are per compared residue - so as genomic DNA is some 1,000 fold longer than protein sequences on average, there is an additional massive CPU load.

Finally the algorithms can be looping or not. A Looping algorithm is one in which the protein information can be repeated in the DNA target sequence. This could either be due to mutliple copies of the gene in the DNA sequence or multiple copies of a domain in a single gene. Looping algorithms are given a 'L' tag. By default, when you use profile-HMMs you use a looping model

For the genewise family the following algorithms are available.

genewise 21:93: The largest genewise algorithm which also contains a complex flanking model to prevent inappropiate gene predictions
genewise 21:93L: The same algorithm with a looping mode. This allows a protein HMM (nearly always a HMM) to match multiple times a DNA sequence. This could be due to multiple domains in a single gene or multiple genes in a DNA sequence with the domain. The algorithm doesn't distinguish between these possibilities.
genewise 6:23: This is a smaller, (and so faster) algorithm. The approximations made compared to genewise 21:93 are that there is no poly-pyrimidine tract in the intron, and that introns from match states are not distinct from introns in insert states.
A side effect of these approximations is that 6:23 is much more robust with respect to unmasked repeats and strange composition effects found in the DNA sequences.
genewise 6:23L: The same algorithm as 6:23 but in looping mode
genewise 4:21: The smallest algorithm in the genewise family, with an additional approximation of not distinguishing between introns of different phases. This has been compiled for short protein sequences only - effectively only profile-HMMs.

For the estwise family the following algorithms are available

estwise 3:33: The largest estwise algorithm, modelling potential insertion or deletions throughout the alignment of the protein information to the DNA sequence.
estwise 3:33L: The same algorithm but in looping mode.
estwise 3:12: A slimmer algorithm designed for faster db searching. The algorithm models enough insertions or deletions of DNA bases to 'ride through' a indel region without too much penalty, even if it doesn't model the most correct one.

Next: Scores Up: Concepts and conventions Previous: The NULL model Contents

Eric DEVEAUD 2015-02-27