Published reference genomes should be re-annotated before use as references for RNA-Seq experiments

RNA-seq based on short reads generated by next generation sequencing technologies has become the main approach to study differential gene expression. Until now, the main applications of this technique have been to study the variation of gene expression in a whole organism, tissue or cell type under different conditions or at different developmental stages. However, RNA-seq also has a great potential to be used in evolutionary studies to investigate gene expression divergence in closely related species.

Researchers from the Göttingen Center for Molecular Biosciences show that the published genomes and annotations of the three closely related Drosophila species D. melanogaster, D. simulans and D. mauritiana have limitations for inter-specific gene expression studies. This is due to missing gene models in at least one of the genome annotations, unclear orthology assignments and significant gene length differences in the different species. A comprehensive evaluation of four statistical frameworks (DESeq2, DESeq2 with length correction, RPKM-limma and RPKM-voom-limma) shows that none of these methods sufficiently accounts for inter-specific gene length differences, which inevitably results in false positive candidate genes. The researchers propose that published reference genomes should be re-annotated before using them as references for RNA-seq experiments to include as many genes as possible and to account for a potential length bias. They have developed a straight-forward reciprocal re-annotation pipeline that allows to reliably compare the expression for nearly all genes annotated in D. melanogaster.

Schematic representation of length bias in inter-species differential expression
analysis and a reciprocal re-annotation strategy to correct it


a Length bias in the analysis of a non-differentially expressed gene. Coloured rectangles represent the part of the transcript which is included as reference for the RNA-seq reads to map to, while unfilled rectangles are regions of the transcript which are omitted and to which RNA-seq reads cannot be mapped. Red “N”s represent sequencing errors that prevent the complete annotation of a transcript. Mapped reads are shown as thin black lines and the number bellow indicates the total of reads mapped. (upper panel) If one transcript is shorter in one of the references compared to its orthologs, for the same expression levels fewer reads will map to it. This can result in false positives in the analysis of differential expression. (lower panel) The strategy to correct this bias is to shorten the orthologs in the other references to match the length of the shorter sequence.

b Pipeline of reciprocal transcriptome re-annotation method. Black numbers in white circles represent genome annotation steps using the “est2genome” command of Exonerate. Grey numbers in grey circles represent conversion of the resulting GFF file into a new transcript set. Filled horizontal bars represent the annotated set of transcripts; non-filled horizontal bars at the start/end of the transcripts represent parts of the transcript that cannot be correctly annotated in one reference and are therefore eliminated from the transcript set. The boxes with red frame indicate the transcript sets that will be used as reference for RNA-seq read mapping (after confirmation by reciprocal blast).

Step 1: the transcript set of the best annotated genomes (D. melanogaster in our study) is used to annotate one of the other genomes (D. simulans in this study) and generate a new transcript set for this species. Due to sequencing errors, some transcripts will be shorter.

Step 2: the new transcript set form D. simulans is used to annotate the last genome (D. mauritiana in this study). The gene set generated contains shorter transcripts due to sequencing errors in D. mauritiana but also in D. simulans.

Step 3: the transcript set from D. mauritiana is used to re-annotate the previously generated set from D. simulans to integrate the information from the D. mauritiana assembly.

Step 4: the second transcript set from D. simulans is used to annotate the D. melanogaster set in order to integrate the information from D. simulans and D. mauritiana

The researchers conclude that their reciprocal re-annotation of previously published genomes facilitates the analysis of significantly more genes in an inter-specific differential gene expression study. They propose that the established pipeline can easily be applied to re-annotate other genomes of closely related animals and plants to improve comparative expression analyses.

Torres-Oliva M, Almudi I, McGregor AP, Posnien N. (2016) A robust (re-)annotation approach to generate unbiased mapping references for RNA-seq-based analyses of differential expression across closely related species. BMC Genomics 17(1):392. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.