As the tree of life is populated with sequenced genomes ever more densely, the new challenge is the accurate and consistent annotation of entire clades of genomes. Researchers from the University of Greifswald address this problem with a new approach to comparative gene finding that takes a multiple genome alignment of closely related species and simultaneously predicts the location and structure of protein-coding genes in all input genomes, thereby exploiting negative selection and sequence conservation. The model prefers potential gene structures in the different genomes that are in agreement with each other, or – if not – where the exon gains and losses are plausible given the species tree. The researchers formulate the multi-species gene finding problem as a binary labeling problem on a graph. The resulting optimization problem is NP hard, but can be efficiently approximated using a subgradient-based dual decomposition approach.
The joint gene structure graph G for a set of k homologous sequences
Nodes represent candidate exons. Green edges represent candidate introns or intergenic regions. Each path from the source si to the sink ℓi is a possible gene structure in sequence i. Homologous candidate exons are at the same time leaf nodes of phylogenetic trees (red edges and nodes). A joint gene structure is sought: a binary labeling ( , ) of all nodes in G, whose restriction on the extant nodes in green defines a collection of k paths si ⇝ ℓi; i = 1; : : : ; k (highlighted in yellow).
The proposed method was tested on whole-genome alignments of 12 vertebrate and 12 Drosophila species. The accuracy was evaluated for human, mouse and D. melanogaster and compared to competing methods. Results suggest that this method is well-suited for annotation of (a large number of) genomes of closely related species within a clade, in particular, when RNA-Seq data is available for many of the genomes. The transfer of existing annotations from one genome to another via the genome alignment is more accurate than previous approaches that are based on protein-spliced alignments, when the genomes are at close to medium distances.
Availability – The method is implemented in C++ as part of Augustus and available open source at http://bioinf.uni-greifswald.de/augustus/