The protein model

Next: Start End points Up: Model Previous: Parameterisation of the model Contents

The protein model

For the emissions of the actually underlying amino acids when we have a profile HMM, we are lucky - we can take the probabilies defined in the HMMer2 models. This is completely natural and means I don't have to worry about deriving probabilities for the profile HMMs

In the case where we have a protein sequence, I somehow have to get to a profile HMM type representation. Thankfully the smith waterman algorithm in terms of architecture is very close to a profile HMM, and so the only problem is mapping the usual scores used in the smith waterman algorithm to probabilites. This is quite hard to do correctly, but I've hacked it by knowing that the blosum62 matrix is given in half bits, in other words using a 2*log2 mapping from probability space to the give scores in the matrix. By reversing this process one can get pretty good emission probability for the amino acids. I now assumme that the gap penalities are as if they were written in half bits. A certain amount of normalisation is required to make sure things add to one, and eh voila - one profile HMM from a single sequence.

Next: Start End points Up: Model Previous: Parameterisation of the model Contents

Eric DEVEAUD 2015-02-27