The gene model

Next: The NULL model Up: Model Previous: Start End points Contents

The gene model

For the emissions of the gene model we had to do more work. What we did was to make a database of known genes, with annotated gene structure. These genes then provided a raw set of counts for particular parts of the gene structure. It is these raw counts which are stored in the .gf files. (we store the raw counts because one might want to do something clever for deriving the probabilities of certain things using these counts. Counts are the basis for the probability derivations, not frequencies).

The only issue here is what to do with the splice sites. We were well aware that the information in the splice sites is considerably more than just the simple position matrix. We chose to use a single branching (biased) decision tree, in which each branch either carried along the main trunk of the tree or ended in a leaf, each leaf representing a consensus build from A,T,G,C or N for any character. This decision tree could be easily constructed by chosing the most common consensus (where N is allowed where a position is better represented by N than any specific residue), and then removing that consensus from the list of observed consensi, and then repeating the process. This also gave us the same basis (counts) for each consensus used in the splice sites.

One additional twist came about in the splice site development. The splice sites overlap between their consensi and the coding sequence region. These overlaps need to be treated correctly: the problem is that probabilistically we have two processes wanting to account for the same DNA bases. This was solved by assumming conditional independence between the two processes. A more formal mathematicall approach can be found in the documented called 'probappendix'.

Next: The NULL model Up: Model Previous: Start End points Contents

Eric DEVEAUD 2015-02-27