Next: Scores
Up: Concepts and conventions
Previous: The NULL model
Contents
Algorithms
The algorithms are then based around this central model, but
have a variety of features removed from it progressively, either
due to biological constraints (bacterial sequences have no introns,
so there is no need to model them) or to speed up the the algorithm.
Algorithms are named in two parts, descriptive-word state-number:transition-number.
The descriptive word indicates the biological model. At the moment
there are 2 such biological models in the package
- genewise
- comparisons of protein information to genomic DNA
- estwise
- comparisons of protein information to cDNA/bacterial DNA (no
introns)
There are many other models being worked on in development
- sywise
- comparisons of genomic DNA to genomic DNA
- parawise
- comparions of cDNA to cDNA
The state-number:transition-number is the number of states in the model
followed by the number of transitions. GeneWise 21:93 is the most complicated
model, with 21 states and 93 transitions. The number of states is directly
proportional to the memory usage of the program. The number of transitions
is roughly proportional to the CPU time of the algorithm. For comparison the
standard smithwaterman algorithm is a 3:7 algorithm (3 states, 7 transitions).
These numbers are per compared residue - so as genomic DNA is some 1,000 fold
longer than protein sequences on average, there is an additional massive
CPU load.
Finally the algorithms can be looping or not. A Looping algorithm is one in
which the protein information can be repeated in the DNA target sequence.
This could either be due to mutliple copies of the gene in the DNA sequence
or multiple copies of a domain in a single gene. Looping algorithms are
given a 'L' tag. By default, when you use profile-HMMs you use a looping model
For the genewise family the following algorithms are available.
- genewise 21:93
- The largest genewise algorithm which also contains
a complex flanking model to prevent inappropiate gene predictions
- genewise 21:93L
- The same algorithm with a looping mode. This allows
a protein HMM (nearly always a HMM) to match multiple times a DNA sequence.
This could be due to multiple domains in a single gene or multiple genes
in a DNA sequence with the domain. The algorithm doesn't distinguish between
these possibilities.
- genewise 6:23
- This is a smaller, (and so faster) algorithm. The
approximations made compared to genewise 21:93 are that there is no
poly-pyrimidine tract in the intron, and that introns from match states
are not distinct from introns in insert states.
A side effect of these approximations is that 6:23 is much more robust
with respect to unmasked repeats and strange composition effects found
in the DNA sequences.
- genewise 6:23L
- The same algorithm as 6:23 but in looping mode
- genewise 4:21
- The smallest algorithm in the genewise family,
with an additional approximation of not distinguishing between introns
of different phases. This has been compiled for short protein sequences
only - effectively only profile-HMMs.
For the estwise family the following algorithms are available
- estwise 3:33
- The largest estwise algorithm, modelling potential
insertion or deletions throughout the alignment of the protein
information to the DNA sequence.
- estwise 3:33L
- The same algorithm but in looping mode.
- estwise 3:12
- A slimmer algorithm designed for faster db searching.
The algorithm models enough insertions or deletions of DNA bases to
'ride through' a indel region without too much penalty, even if it
doesn't model the most correct one.
Next: Scores
Up: Concepts and conventions
Previous: The NULL model
Contents
Eric DEVEAUD
2015-02-27