Next: , Previous: Advantages of qod, Up: Overview


2.3 Maximal Common Intervals and Segmentation of the Central Genome

The algorithm description requires a formal statement of the problem. We are given a central genome T, which we need to compare with k other compared genomes: G_1, \cdots, G_k. For each compared genome G_j with 1 \leq j \leq k, we compute between T and G_j all local pairwise similarities whose statistical significance lies above a user-defined threshold. Each local alignment represents a pair of genomic intervals, one from T and one from G_j, that are aligned with each other. We consider the intervals on T of all those local pairwise similarities: they can be disjoint, overlap, or even include each other on T. The intervals corresponding to the T versus G_j comparison form the \cal C_j collection of base intervals on T. We assume that the n_j intervals of \cal C_j are ordered first by increasing beginning position in T, second by decreasing length. For 1 \leq i \leq n_j, the i^th interval of \cal C_j is denoted I_i^j (the superscript indicates the collection, and the subscript the interval index). The beginning and end positions of an interval I are denoted by b(I) and e(I) respectively. From the k input collections, \cal C_j with 1 \leq j \leq k, we want to compute all Maximal Common Intervals (mci), which we define below.

Definitions

An interval J is common to all \cal C_1 \leq j \leq k if and only if for any collection \cal C_j there exists an interval say I_i^j from \cal C_j such that J \subseteq I_i^j. Assume J = [p,q] with p<q is a common interval. J is said to be a maximal if neither [p-1,q] nor [p,q+1] are common intervals.

Basic properties

Several properties of mci underlie our algorithm.

  1. It can be shown that an mci beginning, respectively end, position is the beginning, resp. end, position of some base interval.
  2. As mci are intersections of base intervals, which can be determined from the interval endpoints, it follows that one only needs to consider the base interval endpoints, not the base intervals themselves, to compute mci.
  3. An mci is the intersection of some base intervals, one from each collection. First, the mci beginning (resp. end) position will be the largest beginning (resp. the smallest end) position of the base intervals overlapping the beginning position.
  4. Moreover, for a given position, if at least an interval of each collection includes it, there exists an mci covering this position.
  5. Last, because of maximality any two mci can be disjoint or overlap, but cannot include each other. As a corrollary, all mci are totally ordered by increasing beginning position.