A wealth of computational methods have been designed to meet the needs of comparative genomics; most attempt to find shared and specific regions in the compared genomes. The two mostly used bioinformatic solutions for this sake rely on different information levels: the genome or proteome levels. The first involves comparing the orfeomes/proteomes, as sets of proteins, to predict orthology with the reciprocal blast-hit approach, which is time consuming. Moreover, this requires the proteins to be correctly annotated. The second solution, whole genome alignment, is a computationnaly difficult optimisation problem. Hence, heuristic alignment tools usually build a highest scoring chain of local alignments and try to infer the evolution of each genome in term of rearrangements (duplication, inversion, transposition), another np-hard problem. The output alignment is sensitive to the method's parameters, the setting of which requires trained users. Consequently, even with a whole genome alignment at hand, which could be long and complex to obtain, it is not straightforward to determine the core genome of a bacterial species.
Note that neither the chaining, nor the rearrangement inference steps solved during a whole genome alignment (and other methods) are necessary to determine the shared and specific parts of the genomes under consideration. Hence, whole genome alignment may be too involved for some goals in comparative genomics. This is the rationale that led us to propose a new formulation of multiple genome comparison to meet only this goal. We introduce a novel concept, Maximum Common Intervals: a genome region that cannot be extended and is shared (i.e., alignable) across all genomes1. The formulation is, given the set of pairwise local similarities between a central genome and each other genome as input, compute all mci of the center. Apart from avoiding the optimisation of a numerical criterion (which often turns to be np-hard), and having few parameters, this formulation has another nice property: it can be solved exactly with a fast algorithm, which moreover yields a unique solution. The number of mci covering a region also indicates whether several possible alignments exist for that region. Hence, the central genome segmentation induced by the mci allows to partition regions into: unshared, shared with only one putative alignment, shared with several putative alignments, where the second category indicates possible orthology relationships.
Hence, qod is an almost linear time algorithm to compute all mci and the corresponding partition; furthermore, it is equiped with a Graphical User Interface (gui) which provides a graphical overview of the multiple similarities on the central genome. If provided with the center's annotations, qod intersects the partition with annotations and deduces from the mci the pairwise alignment of each annotated feature. In potentially orthologous regions, qod automatically selects well conserved features and proposes them as potential annotation transfers to the user. qod, which runs on all major computer platforms, is further equipped with many user friendly options like word search, graphical/textual results or annotation transfers export, etc.
[1] mci: not to be confounded with Common Intervals as subset of shared genes that colocalised in a region