next up previous contents
Next: Algorithms Up: Model Previous: The gene model   Contents

The NULL model

The probability of the model has to compared to an alternative model (in fact to all alternative models which are possible) to allow proper Bayesian inference. This causes considerable difficulty in these algorithms because from a algorithmical point of view we would probably like to use an alternative model which is a single state, like the random model in profile-HMMs, where we can simply 'log-odd' the scored model, whereas from a biological point of view we probably want to use a full gene predicting alternative model.

In addition we need to account for the fact that the protein HMM or protein homolog probably does not extend over all the gene sequence, nor in fact does the gene have to be the only gene in the DNA sequence. This means that there are very good splice sites/poly-pyrimidine tracts outside of the 'matched' alignment can severely de-rail the alignment.

Basically we are in trouble with the random model parts of this problem.

The solutions is different in the genewise21:93 compared to the genewise 6:23 algorithms. Genewise 6:23 is shown in figure 2

Figure 2: GeneWise6:23
\begin{figure}\begin{center}
\leavevmode\epsfxsize 300pt
\epsfbox{genewise6.eps}
\newline\end{center}
\end{figure}

However this still does not solve the problem about what to compare it to.

There are two approaches to the comparison

flat
The homology model is scored against a single state 0.25 emission model. This is effectively 'how likely is this DNA segement has any genes some with this homologous protein/HMM in it' for 21:93. It is, unsurprisingly, a massive 'yes' for nearly all biological DNA, and though a valid number in terms in bayesian inference pretty biologically uninteresing. There is also no decent interpretation of partial scores (ie, scores per domain).

syn
For synchronous model pretends that there is an alternative model of a complete gene which is dragged into the coding part of the gene when the homology model is in the coding part. This is not probabilistically valid, but gives better results and interpretable scores for partial regions, ie domain by domain. (in fact, very similar scores to protein sequences). However I'm worried about what I am doing It would be much better to get some mathematically justification for this.


next up previous contents
Next: Algorithms Up: Model Previous: The gene model   Contents
Eric DEVEAUD 2015-02-27