Next generation sequencing methods, such as RNA-seq, have permitted the exploration of gene expression in a range of organisms which have been studied in ecological contexts but lack a sequenced genome. However, the efficacy and accuracy of RNA-seq annotation methods using reference genomes from related species have yet to be robustly characterised. Here researchers from the University of Bath conduct a comprehensive power analysis employing RNA-seq data from Drosophila melanogaster in conjunction with 11 additional genomes from related Drosophila species to compare annotation methods and quantify the impact of evolutionary divergence between transcriptome and the reference genome. Their analyses demonstrate that, regardless of the level of sequence divergence, direct genome mapping (DGM), where transcript short reads are aligned directly to the reference genome, significantly outperforms the widely used de novo and guided assembly-based methods in both the quantity and accuracy of gene detection. Their analysis also reveals that DGM recovers a more representative profile of Gene Ontology functional categories, which are often used to interpret emergent patterns in genome-wide expression analyses. Lastly, analysis of available primate RNA-seq data demonstrates the applicability of these observations across diverse taxa. The researchers quantification of annotation accuracy and reduced gene detection associated with sequence divergence thus provide empirically derived guidelines for the design of future gene expression studies in species without sequenced genomes.
Flow chart outlining pipelines for transcriptome annotation. De novo and reference sequence-guided transcriptome assembly strategies are shown alongside a simpler direct read-togenome mapping approach where quality controlled short transcriptome reads are aligned directly against the closest available annotated reference sequence. *Reference sequences used to guide transcriptome assembly or to map reads directly onto may or may not be annotated. If they are not annotated, further information is required providing the coordinates of genomic features of interest. Boxes with squared corners indicate processes; boxes with rounded corners indicate data sets.
SHRiMP – SHort Read Mapping Package is available at: http://compbio.cs.toronto.edu/shrimp/