Scores

Next: Halfwise and Blastwise Up: Concepts and conventions Previous: Algorithms Contents

Scores

The scoring system for the algorithms, as eluded to earlier is a Bayesian score. This score is related to the probability that model provided in the algorithm exists in the sequence (often called the posterior). Rather than expressing this probability directly I report a log-odds ratio of the likelhoods of the model compared to a random model of DNA sequence. This ratio (often called bits score because the log is base 2) should be such that a score of 0 means that the two alternatives it has this homology and it is a random DNA sequence are equally likely. However there are two features of the scoring scheme that are not worked into the score that means that some extra calculations are required

The score is reported as a likelhood of the models, and to convert this to a posterior probability you need to factor in the ratio of the prior probabilities for a match. Because you expect a far greater number of sequences to be random than not, this probability of your prior knowledge needs to be worked in. Offhand sensible priors would in the order of probability that there is a match being roughly proportional to the database size.
The posterior probability should not merely be in favour of the homology model over the random model but also be confident in it. In other words you would want probabilities in the 0.95 or 0.99 range before being confident that this match was correct.

These two features mean that the reported bits score needs to be above some threshold which combines the effect of the prior probabilities and the need to have confidence in the posterior probability. In this field people do not tend to work the threshold out rigorously using the above technique, as in fact, deficiencies in the model mean that you end up choosing some arbitary number for a cutoff. In my experience, the following things hold true: bit scores above 35 nearly always mean that there is something there, bit scores between 25-35 generally are true, and bit scores between 18-25 in some families are true but in other families definitely noise. I don't trust anything with a bit score less than 15 bits for these DNA based searches. For protein-HMM to protein there are a number of cases where very negative bit scores are still 'real' (this is best shown by a classical statistical method, usually given as evalues, which is available from the HMMer2 package), but this doesn't seem to occur in the DNA searches.

I have been thinking about using a classical statistic method on top of the bit score, assumming the distribution is an extreme value distribution (EVD), but for DNA it becomes difficult to know what to do with the problem of different lengths of DNA. As these can be wildly different, it is hard to know precisely how to handle it. Currently a single HMM compared to a DNA database can produce evalues using Sean Eddy's EVD fitting code but, I am not completely confident that I am doing the correct thing. Please use it, but keep in mind that it is an experimental feature.

Next: Halfwise and Blastwise Up: Concepts and conventions Previous: Algorithms Contents

Eric DEVEAUD 2015-02-27